are nested vector allocated in consecutive space - c++

For example, the following code creates 1000000 vector, each of them has length of 10.
After that we may sequentially scan the vector several times. If the 2nd-layer vectors are allocated in consecutive space( may few 2nd-layer vectors can fit in a cache block), the following access are efficient. But, if the 2nd-layer vectors are allocated in different places, each time we leave the inner loop we may jump to a random places to get the data, which is not efficient.
vector<vector<int > > a(1000000 , vector<int>(10))
for (int i = 0; i < a.size(); i++)
{
for (int j = 0; j< a[i].size() ; j++) {
a[i][j]++;
}
}
Furthermore, if the 2nd-layer vectors are allocated in consecutive space at first. After we push_back elements into vectors, they may be moved to other space due to the lack of space to extent them in-place. Will they still be kept in nearby?
Thank you.
EDIT1
Thanks, is there any implementation that put them together for improving performance of sequential scanning ?

vector<int> is just a small controller class, typically three words long. The actual managed dynamic memory is, well, allocated dynamically, so it is in essentially random locations. Your outer vector manages a contiguous range of inner vectors, but each inner vector manages an unrelated range of ints.
If you want contiguous storage, consider a single vector<int> of size 1000000 × 10 and access it in strides.

are nested vector allocated in consecutive space?
No! The space allocated from the inner vectors will not be guaranteed to be consecutive! Only the vector instances themselves will appear in a consecutive memory section managed by the outer vector.

Related

How to reserve a multi-dimensional Vector without increasing the vector size?

I have data which is N by 4 which I push back data as follows.
vector<vector<int>> a;
for(some loop){
...
a.push_back(vector<int>(4){val1,val2,val3,val4});
}
N would be less than 13000. In order to prevent unnecessary reallocation, I would like to reserve 13000 by 4 spaces in advance.
After reading multiple related posts on this topic (eg How to reserve a multi-dimensional Vector?), I know the following will do the work. But I would like to do it with reserve() or any similar function if there are any, to be able to use push_back().
vector<vector<int>> a(13000,vector<int>(4);
or
vector<vector<int>> a;
a.resize(13000,vector<int>(4));
How can I just reserve memory without increasing the vector size?
If your data is guaranteed to be N x 4, you do not want to use a std::vector<std::vector<int>>, but rather something like std::vector<std::array<int, 4>>.
Why?
It's the more semantically-accurate type - std::array is designed for fixed-width contiguous sequences of data. (It also opens up the potential for more performance optimizations by the compiler, although that depends on exactly what it is that you're writing.)
Your data will be laid out contiguously in memory, rather than every one of the different vectors allocating potentially disparate heap locations.
Having said that - #pasbi's answer is correct: You can use std::vector::reserve() to allocate space for your outer vector before inserting any actual elements (both for vectors-of-vectors and for vectors-of-arrays). Also, later on, you can use the std::vector::shrink_to_fit() method if you ended up inserting a lot less than you had planned.
Finally, one other option is to use a gsl::multispan and pre-allocate memory for it (GSL is the C++ Core Guidelines Support Library).
You've already answered your own question.
There is a function vector::reserve which does exactly what you want.
vector<vector<int>> a;
a.reserve(N);
for(some loop){
...
a.push_back(vector<int>(4){val1,val2,val3,val4});
}
This will reserve memory to fit N times vector<int>. Note that the actual size of the inner vector<int> is irrelevant at this point since the data of a vector is allocated somewhere else, only a pointer and some bookkeeping is stored in the actual std::vector-class.
Note: this answer is only here for completeness in case you ever come to have a similar problem with an unknown size; keeping a std::vector<std::array<int, 4>> in your case will do perfectly fine.
To pick up on einpoklum's answer, and in case you didn't find this earlier, it is almost always a bad idea to have nested std::vectors, because of the memory layout he spoke of. Each inner vector will allocate its own chunk of data, which won't (necessarily) be contiguous with the others, which will produce cache misses.
Preferably, either:
Like already said, use an std::array if you have a fixed and known amount of elements per vector;
Or flatten your data structure by having a single std::vector<T> of size N x M.
// Assuming N = 13000, M = 4
std::vector<int> vec;
vec.reserve(13000 * 4);
Then you can access it like so:
// Before:
int& element = vec[nIndex][mIndex];
// After:
int& element = vec[mIndex * 13000 + nIndex]; // Still assuming N = 13000

How Physically are Arrays Stored (Specifically with dimensions greater than 2)?

What I Know
I know that arrays int ary[] can be expressed in the equivalent "pointer-to" format: int* ary. However, what I would like to know is that if these two are the same, how physically are arrays stored?
I used to think that the elements are stored next to each other in the ram like so for the array ary:
int size = 5;
int* ary = new int[size];
for (int i = 0; i < size; i++) { ary[i] = i; }
This (I believe) is stored in RAM like: ...[0][1][2][3][4]...
This means we can subsequently replace ary[i] with *(ary + i) by just increment the pointers' location by the index.
The Issue
The issue comes in when I am to define a 2D array in the same way:
int width = 2, height = 2;
Vector** array2D = new Vector*[height]
for (int i = 0; i < width; i++) {
array2D[i] = new Vector[height];
for (int j = 0; j < height; j++) { array2D[i][j] = (i, j); }
}
Given the class Vector is for me to store both x, and y in a single fundamental unit: (x, y).
So how exactly would the above be stored?
It cannot logically be stored like ...[(0, 0)][(1, 0)][(0, 1)][(1, 1)]... as this would mean that the (1, 0)th element is the same as the (0, 1)th.
It cannot also be stored in a 2d array like below, as the physical RAM is a single 1d array of 8 bit numbers:
...[(0, 0)][(1, 0)]...
...[(0, 1)][(1, 1)]...
Neither can it be stored like ...[&(0, 0)][&(1, 0)][&(0, 1)][&(1, 1)]..., given &(x, y) is a pointer to the location of (x, y). This would just mean each memory location would just point to another one, and the value could not be stored anywhere.
Thank you in advanced.
What OP is struggling with a dynamically allocated array of pointers to dynamically allocated arrays. Each of these allocations is its own block of memory sitting somewhere in storage. There is no connection between them other than the logical connection established by the pointers in the outer array.
To try to visualize this say we make
int ** twodee;
twodee = new int*[4];
for (int i = 0; i < 4; i++)
{
twodee[i] = new int[4];
}
and then
int count = 1;
for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
twodee[i][j] = count++;
}
}
so we should wind up with twodee looking something like
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
right?
Logically, yes. But laid out in memory twodee might look something like this batsmurph crazy mess:
You can't really predict where your memory will be, you're at the mercy of the whatever memory manager handles the allocations and what already in storage where it might have been efficient for your memory to go. This makes laying dynamically-allocated multi-dimensional arrays out in your head almost a waste of time.
And there are a whole lot of things wrong with this when you get down into the guts of what a modern CPU can do for you. The CPU has to hop around a lot, and when it's hopping, it's ability to predict and preload the cache with memory you're likely to need in the near future is compromised. This means your gigahertz computer has to sit around and wait on your megahertz RAM a lot more than it should have to.
Try to avoid this whenever possible by allocating single, contiguous blocks of memory. You may pick up a bit of extra code mapping one dimensional memory over to other dimensions, but you don't lose any CPU time. C++ will have generated all of that mapping math for you as soon as you compiled [i][j] anyway.
The short answer to your question is: It is compiler dependent.
A more helpful answer (I hope) is that you can create 2D arrays that are layed out directly in memory, or you can create "2D arrays" that are actually 1D arrays, some with data, some with pointers to arrays.
There is a convention that the compiler is happy to generate the right kind of code to dereference and/or calculate the address of an element within an array when you use brackets to access an element in the array.
Generally arrays that are known to be 2D at compile time (eg int array2D[a][b]) will be layed out in memory without extra pointers and the compiler knows to multiply AND add to get an address each time there is an access. If your compiler isn't good at optimizing out the multiply, it makes repeated accesses much slower than they can be, so in the old days we often did pointer math ourselves to avoid the multiply if possible.
There is the issue that a compiler might optimize by rounding the lower dimension size up to a power of two, so a shift can be used instead of multiply, which would then require padding the locations (then even though they are all in one memory block, there are meaningless holes).
(Also, I'm pretty sure I've run into the problem that within a procedure, it needs to know which way the 2D array really is, so you may need to declare parameters in a way that lets the compiler know how to code the procedure, eg a[][] is different from *a[]). And obviously you can actually get the pointer from the array of pointers, if that is what you want--which isn't the same thing as the array it points too, of course.
In your code, you have clearly declared a full set of the lower dimension 1D arrays (inside the loop), and you have ALSO declared another 1D array of pointers you use to get to each one without a mulitply--instead by a dereference. So all those things will be in memory. Each 1D array will surely be sequentially layed out in a contiguous block of memory. It is just that it is entirely up to the memory manager as to where those 1D arrays are, relative to each other. (I doubt a compiler is smart enough to actually do the "new" ops at compile time, but it is theoretically possible, and would obviously affect/control the behavior if it did.)
Using the extra array of pointers clearly avoids the multiply ever and always. But it takes more space, and for sequential access actually makes the accesses slower and bigger (the extra dereference) versus maintaining a single pointer and one dereference.
Even if the 1D arrays DO end up contiguous sometimes, you might break it with another thread using the same memory manager, running a "new" while your "new" inside the loop is repeating.

Fast way to push_back a vector many times

I have identified a bottleneck in my c++ code, and my goal is to speed it up. I am moving items from one vector to another vector if a condition is true.
In python, the pythonic way of doing this would be to use a list comprehension:
my_vector = [x for x in data_vector if x > 1]
I have hacked a way to do this in C++, and it is working fine. However, I am calling this millions of times in a while-loop and it is slow. I do not understand much about memory allocation, but I assume that my problem has to do with allocating memory over-and-over again using push_back. Is there a way to allocate my memory differently to speed up this code? (I do not know how large my_vector should be until the for-loop has completed).
std::vector<float> data_vector;
// Put a bunch of floats into data_vector
std::vector<float> my_vector;
while (some_condition_is_true) {
my_vector.clear();
for (i = 0; i < data_vector.size(); i++) {
if (data_vector[i] > 1) {
my_vector.push_back(data_vector[i]);
}
}
// Use my_vector to render graphics on the GPU, but do not change the elements of my_vector
// Change the elements of data_vector, but not the size of data_vector
}
Use std::copy_if, and reserve data_vector.size() for my_vector initially (as this is the maximum possible number of elements for which your predicate could evaluate to true):
std::vector<int> my_vec;
my_vec.reserve(data_vec.size());
std::copy_if(data_vec.begin(), data_vec.end(), std::back_inserter(my_vec),
[](const auto& el) { return el > 1; });
Note that you could avoid the reserve call here if you expect that the number of times that your predicate evaluates to true will be much less than the size of the data_vector.
Though there are various great solutions posted by others for your query, it seems there is still no much explanation for the memory allocation, which you do not much understand, so I would like to share my knowledge about this topic with you. Hope this helps.
Firstly, in C++, there are several types of memory: stack, heap, data segment.
Stack is for local variables. There are some important features associated with it, for example, they will be automatically deallocated, operation on it is very fast, its size is OS-dependent and small such that storing some KB of data in the stack may cause an overflow of memory, et cetera.
Heap's memory can be accessed globally. As for its important features, we have, its size can be dynamically extended if needed and its size is larger(much larger than stack), operation on it is slower than stack, manual deallocation of memory is needed (in nowadays's OS, the memory will be automatically freed in the end of program), et cetera.
Data segment is for global and static variables. In fact, this piece of memory can be divided into even smaller parts, e.g. BBS.
In your case, vector is used. In fact, the elements of vector are stored into its internal dynamic array, that is an internal array with a dynamic array size. In the early C++, a dynamic array can be created on the stack memory, however, it is no longer that case. To create a dynamic array, ones have to create it on heap. Therefore, the elements of vector are stored in an internal dynamic array on heap. In fact, to dynamically increase the size of an array, a process namely memory reallocation is needed. However, if a vector user keeps enlarging his or her vector, then the overhead cost of reallocation cost will be high. To deal with it, a vector would firstly allocate a piece of memory that is larger than the current need, that is allocating memory for potential future use. Therefore, in your code, it is not that case that memory reallocation is performed every time push_back() is called. However, if the vector to be copied is quite large, the memory reserved for future use will be not enough. Then, memory allocation will occur. To tackle it, vector.reserve() may be used.
I am a newbie. Hopefully, I have not made any mistake in my sharing.
Hope this helps.
Run the code twice, first time only counting, how many new elements you will need. Then use reserve to already allocate all the memory you need.
while (some_condition_is_true) {
my_vector.clear();
int newLength = 0;
for (i = 0; i < data_vector.size(); i++) {
if (data_vector[i] > 1) {
newLength++;
my_vector.reserve(newLength);
for (i = 0; i < data_vector.size(); i++) {
if (data_vector[i] > 1) {
my_vector.push_back(data_vector[i]);
}
}
// Do stuff with my_vector and change data_vector
}
I doubt allocating my_vector is the problem, especially if the while loop is executed many times as the capacity of my_vector should quickly become sufficient.
But to be sure you can just reserve capacity in my_vector corresponding to the size of data_vector:
my_vector.reserve(data_vector.size());
while (some_condition_is_true) {
my_vector.clear();
for (auto value : data_vector) {
if (value > 1)
my_vector.push_back(value);
}
}
If you are on Linux you can reserve memory for my_vector to prevent std::vector reallocations which is bottleneck in your case. Note that reserve will not waste memory due to overcommit, so any rough upper estimate for reserve value will fit your needs. In your case the size of data_vector will be enough. This line of code before while loop should fix the bottleneck:
my_vector.reserve(data_vector.size());

Different addresses while filling a std::vector

Wouldn't you expect the addresses printed by the two loops to be the same? I was, and I cannot understand why (sometimes) they are different.
#include <iostream>
#include <vector>
using namespace std;
struct S {
void print_address() {
cout << this << endl;
}
};
int main(int argc,char *argv[]) {
vector<S> v;
for (size_t i = 0; i < 10; i++) {
v.push_back( S() );
v.back().print_address();
}
cout << endl;
for (size_t i = 0; i < v.size(); i++) {
v[i].print_address();
}
return 0;
}
I tested this code with many local and on-line compilers and the output I get looks like this (the last three figures are always the same):
0xaec010
0xaec031
0xaec012
0xaec013
0xaec034
0xaec035
0xaec036
0xaec037
0xaec018
0xaec019
0xaec010
0xaec011
0xaec012
0xaec013
0xaec014
0xaec015
0xaec016
0xaec017
0xaec018
0xaec019
I spotted this because making some initialization in the first loop I obtained uninitialized object in the subsequent part of the program. Am I missing something?
Because when vector capicity changes, it reallocates elements. If you std::vector::reserve enough capacity, no reallcation is needed, it will print same address.
vector<S> v;
v.reserve(10);
Note: properly use std::vector::reserve will increase application performance, because no unnecessary reallocation and objects copy.
The vector is performing re-allocations in order to grow as needed. Each time it does this, it allocates a larger buffer for the data and copies the elements across. You can see this clearly in the first loop, where each address jump is followed by a larger sequence of consecutive addresses. In the second loop, you just look at the addresses after the final reallocation.
0xaec010
0xaec031 <--
0xaec012 <--
0xaec013
0xaec034 <--
0xaec035
0xaec036
0xaec037
0xaec018 <--
0xaec019
The simplest way to instantiate a vector with 10 S objects would be
std::vector<S> v(10);
This would involve no re-allocations. See also std::vector::reserve.
Vector elements are stored contiguously; that is, they're all in a row in memory. Your vector object has to allocate space for this contiguous block of elements.
Your vector can't just keep having things added to it indefinitely. It has to grow the space it has allocated. The memory model typically doesn't allow us to expand a memory block — we have to create a new one instead. When the vector does this, it has to move all its elements to the new space. This is occurring several times within your first loop.
If you'd done:
vector<S> v;
v.reserve(10);
(which you can, since you know you'll end up with 10 elements), then no re-allocation would have been necessary, and the addresses would not have changed.
I'm not really surprised that they can change. As the vector initially has no size, it's likely to reallocate the vector once or twice during the initial loop. That'll change the base address of the vector. It's not impossible that after a resize, you'll end up using an address you used before (though I find that somewhat surprising. Are you sure about the first part of the addresses?)
If you want to ensure they don't change, you need to add a v.reserve() before you start pushing stuff on it.

Flattening a 3D array in c++ for use with MPI

Can anyone help with the general format for flattening a 3D array using MPI? I think I can get the array 1 dimensional just by using (i+xlength*j+xlength*ylength*k), but then I have trouble using equations that reference particular cells of the array.
I tried chunking the code into chunks based on how many processors I had, but then when I needed a value that another processor had, I had a hard time. Is there a way to make this easier (and more efficient) using ghost cells or pointer juggling?
You have two options at least. The simpler one is to declare a preprocessor macro that hides the complexity of the index calculation, e.g.:
#define ARR(A,i,j,k) A[(i)*ylength*zlength+(j)*zlength+(k)]
ARR(myarray,i,j,k) = ARR(myarray,i+1,j,k) + ARR(myarray,i,j+1,k) + ...
This is clumsy since the macro will only work with arrays of fixed leading dimensions, e.g. whatever x ylength x zlength.
Much better way to do it is to use so-called dope vectors. Dope vectors are basically indices into the big array. You allocate one big flat chunk of size xlength * ylength * zlength to hold the actual data and then create an index vector (actually a tree in the 3D case). In your case the index has two levels:
top level, consisting of xlength pointers to the
second level, consisting of xlength arrays of pointers, each containing ylength pointers to the beginning of a block of zlength elements in memory.
Let's call the top level pointer array A. Then A[i] is a pointer to a pointer array that describes the i-th slab of data. A[i][j] is the j-th element of the i-th pointer array, which points to data[i][j][0] (if data was a 3D array). Construction of the dope vector works similar to this:
double *data = new double[xlength*ylength*zlength];
double ***A;
A = new double**[xlength];
for (int i = 0; i < xlength; i++)
{
A[i] = new double*[ylength];
for (int j = 0; j < ylength; j++)
A[i][j] = data + i*ylength*zlength + j*zlength;
}
Dope vectors are as easy to use as normal arrays with some special considerations. For example, A[i][j][k] will give you access to the desired element of data. One caveat of dope vectors is that the top level consist of pointers to other pointer tables and not of pointers to the data itself, hence A cannot be used as shortcut for &A[0][0][0], nor A[i] used as shortcut for &A[i][0][0]. Still A[i][j] is equivalent to &A[i][j][0]. Another caveat is that this form of array indexing is slower than normal 3D array indexing since it involves pointer chasing.
Some people tend to allocate a single storage block for both data and dope vectors. They simply place the index at the beginning of the allocated block and the actual data goes after that. The advantage of this method is that disposing the array is as simple as deleting the whole memory block, while disposing dope vectors, created with the code from the previous section, requires multiple invocations of the free operator.