vector memory allocation strategy - c++

i wrote a little piece of code to determine, how memory allocating in a vector is done.
#include <iostream>
#include <vector>
using namespace std;
int main ()
{
vector<unsigned int> myvector;
unsigned int capacity = myvector.capacity();
for(unsigned int i = 0; i < 100000; ++i) {
myvector.push_back(i);
if(capacity != myvector.capacity())
{
capacity = myvector.capacity();
cout << myvector.capacity() << endl;
}
}
return 0;
}
I compiled this using Visual Studio 2008 and g++ 4.5.2 on Ubuntu and got these results:
Visual Studio:
1 2 3 4 6 9 13 19 28 42 63 94 141 211 316 474 711 1066 1599 2398 3597 5395 8092 12138 18207 27310 40965 61447 92170 138255
capacity = capacity * 1.5;
g++:
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072
capacity = capacity * 2;
As you can see, these are two very different results.
Why is this like that? Is it only depending on the compiler or is it addicted to other factors?
Does it really make sense to keep on with doubling the capacity, even for large numbers of elements?

How the vector is growing is implementation defined. So different strategies can be used resulting in different capacity after inserting the same count of elements
If you need to rely on how many items are allocated you should use reserve and/or resize methods of vector

As you can see, VS is adding extra space with smaller chunks, while G++ i doing it by the powers of 2. This is just implementations of the same basic idea: the more elements you add, the more space will be allocated next time (because it is more likely that you will add additional data).
Imagine you've added 1 element to the vector, and I've added 1000. It's more likely that will add another 1000 and it is less likely that you will. This is the reasoning for such a strategy of allocating space.
The exact numbers sure depends on something, but that's the reasoning of the compiler makers, since they can implement it in any way they want.

The standard only defines a vector's behaviour. What really happens internally depends on the implementation.
Doubling the capacity results in an amortized O(n) cost for pushing/popping n elements, which is required for a vector, I guess.
Look here for more details.

Related

Access efficiency of C++ 2D array

I have a 2D array a1[10000][100] with 10000 rows and 100 columns, and also a 2D array a2[100][10000] which is the transposed matrix of a1.
Now I need to access 2 columns (eg. the 21th and the 71th columns) of a1 in the order of a1[0][20], a1[0][70], a1[1][20], a1[1][70], ..., a1[9999][20], a1[9999][70]. Or I can also access a2 to achive the same goal (the order: a2[20][0], a2[70][0], a2[20][1], a2[70][1], ..., a2[20][9999], a2[70][9999]). But the latter is much faster than the former. The related code is simplified as follows (size1 = 10000):
1 sum1 = 0;
2 for (i = 0; i < size1; ++i) {
3 x = a1[i][20];
4 y = a1[i][70];
5 sum1 = x + y;
6 } // this loop is slower
7 a_sum1[i] = sum1;
8
9 sum2 = 0;
10 for (i = 0; i < size1; ++i) {
11 x = a2[20][i];
12 y = a2[70][i];
14 sum2 = x + y;
15 } // this loop is faster
16 a_sum2[i] = sum2;
Accessing more rows (I have also tried 3, 4 rows rather than 2 rows in the example above) of a2 is also faster than accessing the same number of columns of a1. Of course I can also replace Lines 3-5 (or Lines 11-14) with a loop (by using an extra array to store the column/row indexes to be accessed), it also gets the same result that the latter is faster than the former.
Why is the latter much faster than the former? I know something about cache lines but I have no idea of the reason for this case. Thanks.
You can benefit from the memory cache if you access addresses within the same cache line in a short amount of time. The explanation below assumes your arrays contain 4-byte integers.
In your first loop, your two memory accesses in the loop are 50*4 bytes apart, and the next iteration jumps forward 400 bytes. Every memory access here is a cache miss.
In the second loop, you still have two memory accesses that are 50*400 bytes apart, but on the next loop iteration you access addresses that are right next to the previously fetched value. Assuming the common 64-byte cache line size, you only have two cache misses every 16 iterations of the loop, the rest can be served from two cache lines loaded at the start of such a cycle.
This is because C++ has a row-major order (https://en.wikipedia.org/wiki/Row-_and_column-major_order). You should avoid column-major access in C/C++ (https://www.appentra.com/knowledge/checks/pwr010/).
The reason is that the elements are stored by rows and the access by rows allows to better use cache lines, vectorization and other hardware features/techniques.
The reason is cache locality.
a2[20][0], a2[20][1], a2[20][2] ... are stored in memory next to each other. And a1[0][20], a1[1][20], a1[2][20] ... aren't (the same applies to a2[70][0], a2[70][1], a2[70][2] ...).
That means that accessing a1[0][20], a1[1][20], a1[2][20] would waste DRAM bandwidth, as it would use only 4 or 8 bytes of each 64-byte cache line loaded from DRAM.

Concurrent bruteforce and memory overflow : the best of both worlds?

I'm working on a bruteforce algorithm for solving a kind of Puzzle.
That puzzle is a rectangle, and for some reasons irrelevant here, the number of possible solutions of a rectangle whose size is width*height is 2^(min(width, height)) instead of 2^(width*height).
Both dimensions can be considered as in range 1..50. (most often below 30 though)
This way, the numbers of solutions is, at worst, 2^50 (about 1 000 000 000 000 000 so). I store solution as an unsigned 64 bits number, a kind of "seed"
I have two working algortihms for bruteforce solving.
Assuming N is min(width, height) and isCorrect(uint64_t) a predicate that returns whether the solution with given seed is correct or not.
The most naive algorithm is roughly this :
vector<uint64_t> solutions;
for (uint64_t i = 0; i < (1 << N); ++i)
{
if (isCorrect(i))
solutions.push_back(i);
}
It works perfectly (assuming predicate is actually implemented :D) but does not profit from multiples cores, so I'd like to have a multi-threadead approach.
I've come across QtConcurrent, which gives concurrent filter and map functions, that automatically create optimal number of threads to share burden.
So I have a new algorithm that is roughly this :
vector<unit64_t> solutionsToTry;
solutionsToTry.reserve(1 << N);
for (uint64_t i = 0; i < (1 << N); ++i)
solutionsToTry.push_back(i);
//Now, filtering
QFuture<unit64_t> solutions = QtConcurrent::filtered(solutionsToTry, &isCorrect);
It does work too, and a bit faster, but when N goes higher than 20, There's simply not enough room in my RAM to allocate the vecotr (with N = 20 and 64 bits numbes, I need 8,3 GB of RAM. It's okay with swap partitions etc, but sinces it gets multiplied by 2 every time N increases by 1, it can't go further)
Is there a simple way to have concurrent filtering without bloating memory ?
If there isn't, I might rather hand-split loops on 4 threads to get concurrency without optimal size, or write the algorithm in Haskell to get lazy-evaluation and filtering of infinite lists :-)

C++ vector performance with predefined capacity

There is 2 ways to define std::vector(that i know of):
std::vector<int> vectorOne;
and
std::vector<int> vectorTwo(300);
So if i don't define the first and fill it with 300 int's then it has to reallocate memory to store those int's. that would mean it would not for example be address 0x0 through 0x300 but there could be memory allocated inbetween because it has to be reallocated after, but the second vector will already have those addresses reserved for them so there would be no space inbetween.
Does this affect perfomance at all and how could I meassure this?
std::vector is guaranteed to always store its data in a continuous block of memory. That means that when you add items, it has to try and increase its range of memory in use. If something else is in the memory following the vector, it needs to find a free block of the right size somewhere else in memory and copy all the old data + the new data to it.
This is a fairly expensive operation in terms of time, so it tries to mitigate by allocating a slightly larger block than what you need. This allows you to add several items before the whole reallocate-and-move-operation takes place.
Vector has two properties: size and capacity. The former is how many elements it actually holds, the latter is how many places are reserved in total. For example, if you have a vector with size() == 10 and capacity() == 18, it means you can add 8 more elements before it needs to reallocate.
How and when the capacity increases exactly, is up to the implementer of your STL version. You can test what happens on your computer with the following test:
#include <iostream>
#include <vector>
int main() {
using std::cout;
using std::vector;
// Create a vector with values 1 .. 10
vector<int> v(10);
std::cout << "v has size " << v.size() << " and capacity " << v.capacity() << "\n";
// Now add 90 values, and print the size and capacity after each insert
for(int i = 11; i <= 100; ++i)
{
v.push_back(i);
std::cout << "v has size " << v.size() << " and capacity " << v.capacity()
<< ". Memory range: " << &v.front() << " -- " << &v.back() << "\n";
}
return 0;
}
I ran it on IDEOne and got the following output:
v has size 10 and capacity 10
v has size 11 and capacity 20. Memory range: 0x9899a40 -- 0x9899a68
v has size 12 and capacity 20. Memory range: 0x9899a40 -- 0x9899a6c
v has size 13 and capacity 20. Memory range: 0x9899a40 -- 0x9899a70
...
v has size 20 and capacity 20. Memory range: 0x9899a40 -- 0x9899a8c
v has size 21 and capacity 40. Memory range: 0x9899a98 -- 0x9899ae8
...
v has size 40 and capacity 40. Memory range: 0x9899a98 -- 0x9899b34
v has size 41 and capacity 80. Memory range: 0x9899b40 -- 0x9899be0
You see the capacity increase and re-allocations happening right there, and you also see that this particular compiler chooses to double the capacity every time you hit the limit.
On some systems the algorithm will be more subtle, growing faster as you insert more items (so if your vector is small, you waste little space, but if it notices you insert a lot of items into it, it allocates more to avoid having to increase the capacity too often).
PS: Note the difference between setting the size and the capacity of a vector.
vector<int> v(10);
will create a vector with capacity at least 10, and size() == 10. If you print the contents of v, you will see that it contains
0 0 0 0 0 0 0 0 0 0
i.e. 10 integers with their default values. The next element you push into it, may (and likely will) cause a re-allocation. On the other hand,
vector<int> v();
v.reserve(10);
will create an empty vector, but with its initial capacity set to 10 rather than the default (probably 1). You can be certain that the first 10 elements you push into it will not cause an allocation (and the one probably will but not necessarily, as reserve may actually set the capacity to more than what you requested).
You should use reserve() method:
std::vector<int> vec;
vec.reserve(300);
assert(vec.size() == 0); // But memory is allocated
This solves the problem.
In your example it affects the performance greatly. You can expect, that when you overflow the vector, it doubles the allocated memory. So, if you push_back() into vector N times (and you haven't called "reserve()"), you can expect O(logN) reallocations, each of them causing copying of all values. So the total complexity is expected to be O(N*logN), although it is not specified by C++ standard.
The differences can be dramatic because if data is not adjacent in memory, the data may have to be fetched from main memory which is 200 times slower than a l1 cache fetch. This will not happen in a vector because data in a vector is required to be adjacent.
see https://www.youtube.com/watch?v=YQs6IC-vgmo
Use std::vector::reserve, when you can to avoid realloc events. The C++ 'chrono' header has good time utilities, to measure the time difference, in high resolution ticks.

Calculating size of vector of vectors in bytes

typedef vector<vector<short>> Mshort;
typedef vector<vector<int>> Mint;
Mshort mshort(1 << 20, vector<short>(20, -1)); // Xcode shows 73MB
Mint mint(1 << 20, vector<int>(20, -1)); // Xcode shows 105MB
short uses 2 bytes and int 4 bytes; please note that 1 << 20 = 2^20;
I am trying to calculate ahead (on paper) usage of memory but I am unable to.
sizeof(vector<>) // = 24 //no matter what type
sizeof(int) // = 4
sizeof(short) // = 2
I do not understand: mint should be double the mshort but it isn't. When running program only with mshort initialisation Xcode shows 73MB of memory usage; for mint 105MB;
mshort.size() * mshort[0].size() * sizeof(short) * sizeof(vector<short>) // = 1006632960
mint.size() * min[0].size() * sizeof(int) * sizeof(vector<int>) // = 2013265920
//no need to use .capacity() because I fill vectors with -1
1006632960 * 2 = 2013265920
How does one calculate how much space of RAM will 2d std::vector use or 2d std::array use.
I know the sizes ahead and each row has same number of columns.
The memory usage of your vectors of vectors will be e.g.
// the size of the data...
mshort.size() * mshort[0].size() * sizeof(short) +
// the size of the inner vector objects...
mshort.size() * sizeof mshort[0] +
// the size of the outer vector object...
// (this is ostensibly on the stack, given your code)
sizeof mshort +
// dynamic allocation overheads
overheads
The dynamic allocation overheads are because the vectors internally new memory for the elements they're to store, and for speed reasons they may have pools of fixed-sized memory areas waiting for new requests, so if the vector effectively does a new short[20] - with the data needing 40 bytes - it might end up with e.g. 48 or 64. The implementation may actually need to use some extra memory to store the array size, though for short and int there's no need to loop over the elements invoking destructors during delete[], so a good implementation will avoid that allocation and no-op destruction behaviour.
The actual data elements for any given vector are contiguous in memory though, so if you want to reduce the overheads, you can change your code to use fewer, larger vectors. For example, using one vector with (1 << 20) * 20 will have negligible overhead - then rather than accessing [i][j] you can access [i * 20 + j] - you can write a simple class wrapping the vector to do this for you, most simply with a v(i, j) notation...
inline short& operator()(size_t i, size_t j) { return v_[i * 20 + j]; }
inline short operator()(size_t i, size_t j) const { return v_[i * 20 + j]; }
...though you could support v[i][j] by having v.operator[] return a proxy object that can be further indexed with []. I'm sure if you search SO for questions on multi-dimension arrays there'll be some examples - think I may have posted such code myself once.
The main reason to want vector<vector<x>> is when the inner vectors vary in length.
Assuming glibc malloc:
Each memory chunk will allocate additional 8-16 bytes(2 size_t) for memory block header. For 64 bit system it would be 16 bytes.
see code:
https://github.com/sploitfun/lsploits/blob/master/glibc/malloc/malloc.c#L1110
chunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of previous chunk, if allocated | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of chunk, in bytes |M|P|
mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| User data starts here... .
. .
. (malloc_usable_size() bytes) .
. |
It gives me approximately 83886080 for short when adding 16 bytes per row.
26+16+ mshort.size(1048576) * (mshort[0].size(20)*sizeof(short(2)) + sizeof(vector(26))+header(16))
It gives me approximately 125829120 for int.
But then I recompute you numbers and it look like you are on 32 bit...
short 75497472 that is ~73M
long 117440512 that is ~112M
Looks very close to reported ones.
Use capacity not size to get #items number, even if those are the same in your case.
Allocating single vector size row*columns will save you header*1048576 bytes.
Your calculation mshort.size() * mshort[0].size() * sizeof(short) * sizeof(vector<short>) // = 1006632960 is simply wrong. As your calculation, mshort takes 1006632960 which is 960MiB, which is not true.
Let's ignore libc's overhead, and just focus on std::vector<>'s size:
mshort is a vector of 1^20 items, each is vector<short> with 20 items.
So the size shall be:
mshort.size() * mshort[0].size() * sizeof(short) // Size of all short values
+ mshort.size() * sizeof(vector<short>) // Size of 1^20 vector<short>
+ sizeof(mshort) // Size of mshort itself, which can be ignored as overhead
The calculated size is 64MiB.
The same to mint, where the calculated size is 104MiB.
So mint is simply NOT double size of mshort.

out of memory and vector of vectors

I'm implementing a distance matrix that calculates the distance between each point and all the other points and I have 100,000 points, so my matrix size will be 100,000 x 100,000. I implemented that using vector<vector<double> > dist. However, for this large data size it give out of memory error. The following is my code and any help will be really appreciated.
vector<vector<double> > dist(dat.size()) vector<double>(dat.size()));
size_t p,j;
ptrdiff_t i;
#pragma omp parallel for private(p,j,i) default(shared)
for(p=0;p<dat.size();++p)
{
// #pragma omp parallel for private(j,i) default(shared)
for (j = p + 1; j < dat.size(); ++j)
{
double ecl = 0.0;
for (i = 0; i < c; ++i)
{
ecl += (dat[p][i] - dat[j][i]) * (dat[p][i] - dat[j][i]);
}
ecl = sqrt(ecl);
dist[p][j] = ecl;
dist[j][p] = ecl;
}
}
A 100000 x 100000 matrix? A quick calculation shows why this is never going to work:
100000 x 100000 x 8 (bytes) / (1024 * 1024 * 1024) = 74.5 gigabytes...
Even if it was possible to allocate this much memory I doubt very much whether this would be an efficient approach for a real problem.
If you're looking to do some kind of geometric processing on large data sets you may be interested in some kind of spatial tree structure: kd-trees, quadtrees, r-trees maybe?
100,000 * 100,000 = 10,000,000,000 ~= 2^33
It is easy to see that in 32 bits system - an out of memory is guaranteed for such a large data base, without even calculating the fact that we found number of elements, and not number of bytes used.
Even in 64 bits systems, it is highly unlikely that the OS will allow you to so much memory [also note that you actually need much more memory, also because each element you allocate is much more then a byte.]
Did you know that 100,000 times 100,000 is 10 billion? If you're storing the distances as 32-bit integers, that would 40 billion bytes, or 37.5 GB. That is probably more RAM than you have so this will not be feasible.
100,000 x 100,000 x sizeof( double ) = roughly 80GIG (with 8 byte doubles) without the overhead of the vectors.
That's not likely to happen unless you're on a really big machine.
Look at using a database of some sort or one of the C/C++ collection libs that spills large data to disk.
Rogue Wave's SourcePRO class library has a few disk based collection classes but it is not free.