1D or 2D array, what's faster? - c++

I'm in need of representing a 2D field (axes x, y) and I face a problem: Should I use an 1D array or a 2D array?
I can imagine, that recalculating indices for 1D arrays (y + x*n) could be slower than using 2D array (x, y) but I could image that 1D could be in CPU cache..
I did some googling, but only found pages regarding static array (and stating that 1D and 2D are basically the same). But my arrays must me dynamic.
So, what's
faster,
smaller (RAM)
dynamic 1D arrays or dynamic 2D arrays?

tl;dr : You should probably use a one-dimensional approach.
Note: One cannot dig into detail affecting performance when comparing dynamic 1d or dynamic 2d storage patterns without filling books since the performance of code is dependent one a very large number of parameters. Profile if possible.
1. What's faster?
For dense matrices the 1D approach is likely to be faster since it offers better memory locality and less allocation and deallocation overhead.
2. What's smaller?
Dynamic-1D consumes less memory than the 2D approach. The latter also requires more allocations.
Remarks
I laid out a pretty long answer beneath with several reasons but I want to make some remarks on your assumptions first.
I can imagine, that recalculating indices for 1D arrays (y + x*n) could be slower than using 2D array (x, y)
Let's compare these two functions:
int get_2d (int **p, int r, int c) { return p[r][c]; }
int get_1d (int *p, int r, int c) { return p[c + C*r]; }
The (non-inlined) assembly generated by Visual Studio 2015 RC for those functions (with optimizations turned on) is:
?get_1d##YAHPAHII#Z PROC
push ebp
mov ebp, esp
mov eax, DWORD PTR _c$[ebp]
lea eax, DWORD PTR [eax+edx*4]
mov eax, DWORD PTR [ecx+eax*4]
pop ebp
ret 0
?get_2d##YAHPAPAHII#Z PROC
push ebp
mov ebp, esp
mov ecx, DWORD PTR [ecx+edx*4]
mov eax, DWORD PTR _c$[ebp]
mov eax, DWORD PTR [ecx+eax*4]
pop ebp
ret 0
The difference is mov (2d) vs. lea (1d).
The former has a latency of 3 cycles and a a maximum throughput of 2 per cycle while the latter has a latency of 2 cycles and a maximum throughput of 3 per cycle. (According to Instruction tables - Agner Fog
Since the differences are minor, I think there should not be a big performance difference arising from index recalculation. I expect it to be very unlikely to identify this difference itself to be the bottleneck in any program.
This brings us to the next (and more interesting) point:
... but I could image that 1D could be in CPU cache ...
True, but 2d could be in CPU cache, too. See The Downsides: Memory locality for an explanation why 1d is still better.
The long answer, or why dynamic 2 dimensional data storage (pointer-to-pointer or vector-of-vector) is "bad" for simple / small matrices.
Note: This is about dynamic arrays/allocation schemes [malloc/new/vector etc.]. A static 2d array is a contiguous block of memory and therefore not subject to the downsides I'm going to present here.
The Problem
To be able to understand why a dynamic array of dynamic arrays or a vector of vectors is most likely not the data storage pattern of choice, you are required to understand the memory layout of such structures.
Example case using pointer to pointer syntax
int main (void)
{
// allocate memory for 4x4 integers; quick & dirty
int ** p = new int*[4];
for (size_t i=0; i<4; ++i) p[i] = new int[4];
// do some stuff here, using p[x][y]
// deallocate memory
for (size_t i=0; i<4; ++i) delete[] p[i];
delete[] p;
}
The downsides
Memory locality
For this “matrix” you allocate one block of four pointers and four blocks of four integers. All of the allocations are unrelated and can therefore result in an arbitrary memory position.
The following image will give you an idea of how the memory may look like.
For the real 2d case:
The violet square is the memory position occupied by p itself.
The green squares assemble the memory region p points to (4 x int*).
The 4 regions of 4 contiguous blue squares are the ones pointed to by each int* of the green region
For the 2d mapped on 1d case:
The green square is the only required pointer int *
The blue squares ensemble the memory region for all matrix elements (16 x int).
This means that (when using the left layout) you will probably observe worse performance than for a contiguous storage pattern (as seen on the right), due to caching for instance.
Let's say a cache line is "the amount of data transfered into the cache at once" and let's imagine a program accessing the whole matrix one element after another.
If you have a properly aligned 4 times 4 matrix of 32 bit values, a processor with a 64 byte cache line (typical value) is able to "one-shot" the data (4*4*4 = 64 bytes).
If you start processing and the data isn't already in the cache you'll face a cache miss and the data will be fetched from main memory. This load can fetch the whole matrix at once since it fits into a cache line, if and only if it is contiguously stored (and properly aligned).
There will probably not be any more misses while processing that data.
In case of a dynamic, "real two-dimensional" system with unrelated locations of each row/column, the processor needs to load every memory location seperately.
Eventhough only 64 bytes are required, loading 4 cache lines for 4 unrelated memory positions would -in a worst case scenario- actually transfer 256 bytes and waste 75% throughput bandwidth.
If you process the data using the 2d-scheme you'll again (if not already cached) face a cache miss on the first element.
But now, only the first row/colum will be in the cache after the first load from main memory because all other rows are located somewhere else in memory and not adjacent to the first one.
As soon as you reach a new row/column there will again be a cache miss and the next load from main memory is performed.
Long story short: The 2d pattern has a higher chance of cache misses with the 1d scheme offering better potential for performance due to locality of the data.
Frequent Allocation / Deallocation
As many as N + 1 (4 + 1 = 5) allocations (using either new, malloc, allocator::allocate or whatever) are necessary to create the desired NxM (4×4) matrix.
The same number of proper, respective deallocation operations must be applied as well.
Therefore, it is more costly to create/copy such matrices in contrast to a single allocation scheme.
This is getting even worse with a growing number of rows.
Memory consumption overhead
I'll asumme a size of 32 bits for int and 32 bits for pointers. (Note: System dependency.)
Let's remember: We want to store a 4×4 int matrix which means 64 bytes.
For a NxM matrix, stored with the presented pointer-to-pointer scheme we consume
N*M*sizeof(int) [the actual blue data] +
N*sizeof(int*) [the green pointers] +
sizeof(int**) [the violet variable p] bytes.
That makes 4*4*4 + 4*4 + 4 = 84 bytes in case of the present example and it gets even worse when using std::vector<std::vector<int>>.
It will require N * M * sizeof(int) + N * sizeof(vector<int>) + sizeof(vector<vector<int>>) bytes, that is 4*4*4 + 4*16 + 16 = 144 bytes in total, intead of 64 bytes for 4 x 4 int.
In addition -depending on the used allocator- each single allocation may well (and most likely will) have another 16 bytes of memory overhead. (Some “Infobytes” which store the number of allocated bytes for the purpose of proper deallocation.)
This means the worst case is:
N*(16+M*sizeof(int)) + 16+N*sizeof(int*) + sizeof(int**)
= 4*(16+4*4) + 16+4*4 + 4 = 164 bytes ! _Overhead: 156%_
The share of the overhead will reduce as the size of the matrix grows but will still be present.
Risk of memory leaks
The bunch of allocations requires an appropriate exception handling in order to avoid memory leaks if one of the allocations will fail!
You’ll need to keep track of allocated memory blocks and you must not forget them when deallocating the memory.
If new runs of of memory and the next row cannot be allocated (especially likely when the matrix is very large), a std::bad_alloc is thrown by new.
Example:
In the above mentioned new/delete example, we'll face some more code if we want to avoid leaks in case of bad_alloc exceptions.
// allocate memory for 4x4 integers; quick & dirty
size_t const N = 4;
// we don't need try for this allocation
// if it fails there is no leak
int ** p = new int*[N];
size_t allocs(0U);
try
{ // try block doing further allocations
for (size_t i=0; i<N; ++i)
{
p[i] = new int[4]; // allocate
++allocs; // advance counter if no exception occured
}
}
catch (std::bad_alloc & be)
{ // if an exception occurs we need to free out memory
for (size_t i=0; i<allocs; ++i) delete[] p[i]; // free all alloced p[i]s
delete[] p; // free p
throw; // rethrow bad_alloc
}
/*
do some stuff here, using p[x][y]
*/
// deallocate memory accoding to the number of allocations
for (size_t i=0; i<allocs; ++i) delete[] p[i];
delete[] p;
Summary
There are cases where "real 2d" memory layouts fit and make sense (i.e. if the number of columns per row is not constant) but in the most simple and common 2D data storage cases they just bloat the complexity of your code and reduce the performance and memory efficiency of your program.
Alternative
You should use a contiguous block of memory and map your rows onto that block.
The "C++ way" of doing it is probably to write a class that manages your memory while considering important things like
What is The Rule of Three?
What is meant by Resource Acquisition is Initialization (RAII)?
C++ concept: Container (on cppreference.com)
Example
To provide an idea of how such a class may look like, here's a simple example with some basic features:
2d-size-constructible
2d-resizable
operator(size_t, size_t) for 2d- row major element access
at(size_t, size_t) for checked 2d-row major element access
Fulfills Concept requirements for Container
Source:
#include <vector>
#include <algorithm>
#include <iterator>
#include <utility>
namespace matrices
{
template<class T>
class simple
{
public:
// misc types
using data_type = std::vector<T>;
using value_type = typename std::vector<T>::value_type;
using size_type = typename std::vector<T>::size_type;
// ref
using reference = typename std::vector<T>::reference;
using const_reference = typename std::vector<T>::const_reference;
// iter
using iterator = typename std::vector<T>::iterator;
using const_iterator = typename std::vector<T>::const_iterator;
// reverse iter
using reverse_iterator = typename std::vector<T>::reverse_iterator;
using const_reverse_iterator = typename std::vector<T>::const_reverse_iterator;
// empty construction
simple() = default;
// default-insert rows*cols values
simple(size_type rows, size_type cols)
: m_rows(rows), m_cols(cols), m_data(rows*cols)
{}
// copy initialized matrix rows*cols
simple(size_type rows, size_type cols, const_reference val)
: m_rows(rows), m_cols(cols), m_data(rows*cols, val)
{}
// 1d-iterators
iterator begin() { return m_data.begin(); }
iterator end() { return m_data.end(); }
const_iterator begin() const { return m_data.begin(); }
const_iterator end() const { return m_data.end(); }
const_iterator cbegin() const { return m_data.cbegin(); }
const_iterator cend() const { return m_data.cend(); }
reverse_iterator rbegin() { return m_data.rbegin(); }
reverse_iterator rend() { return m_data.rend(); }
const_reverse_iterator rbegin() const { return m_data.rbegin(); }
const_reverse_iterator rend() const { return m_data.rend(); }
const_reverse_iterator crbegin() const { return m_data.crbegin(); }
const_reverse_iterator crend() const { return m_data.crend(); }
// element access (row major indexation)
reference operator() (size_type const row,
size_type const column)
{
return m_data[m_cols*row + column];
}
const_reference operator() (size_type const row,
size_type const column) const
{
return m_data[m_cols*row + column];
}
reference at() (size_type const row, size_type const column)
{
return m_data.at(m_cols*row + column);
}
const_reference at() (size_type const row, size_type const column) const
{
return m_data.at(m_cols*row + column);
}
// resizing
void resize(size_type new_rows, size_type new_cols)
{
// new matrix new_rows times new_cols
simple tmp(new_rows, new_cols);
// select smaller row and col size
auto mc = std::min(m_cols, new_cols);
auto mr = std::min(m_rows, new_rows);
for (size_type i(0U); i < mr; ++i)
{
// iterators to begin of rows
auto row = begin() + i*m_cols;
auto tmp_row = tmp.begin() + i*new_cols;
// move mc elements to tmp
std::move(row, row + mc, tmp_row);
}
// move assignment to this
*this = std::move(tmp);
}
// size and capacity
size_type size() const { return m_data.size(); }
size_type max_size() const { return m_data.max_size(); }
bool empty() const { return m_data.empty(); }
// dimensionality
size_type rows() const { return m_rows; }
size_type cols() const { return m_cols; }
// data swapping
void swap(simple &rhs)
{
using std::swap;
m_data.swap(rhs.m_data);
swap(m_rows, rhs.m_rows);
swap(m_cols, rhs.m_cols);
}
private:
// content
size_type m_rows{ 0u };
size_type m_cols{ 0u };
data_type m_data{};
};
template<class T>
void swap(simple<T> & lhs, simple<T> & rhs)
{
lhs.swap(rhs);
}
template<class T>
bool operator== (simple<T> const &a, simple<T> const &b)
{
if (a.rows() != b.rows() || a.cols() != b.cols())
{
return false;
}
return std::equal(a.begin(), a.end(), b.begin(), b.end());
}
template<class T>
bool operator!= (simple<T> const &a, simple<T> const &b)
{
return !(a == b);
}
}
Note several things here:
T needs to fulfill the requirements of the used std::vector member functions
operator() doesn't do any "of of range" checks
No need to manage data on your own
No destructor, copy constructor or assignment operators required
So you don't have to bother about proper memory handling for each application but only once for the class you write.
Restrictions
There may be cases where a dynamic "real" two dimensional structure is favourable. This is for instance the case if
the matrix is very large and sparse (if any of the rows do not even need to be allocated but can be handled using a nullptr) or if
the rows do not have the same number of columns (that is if you don't have a matrix at all but another two-dimensional construct).

Unless you are talking about static arrays, 1D is faster.
Here’s the memory layout of a 1D array (std::vector<T>):
+---+---+---+---+---+---+---+---+---+
| | | | | | | | | |
+---+---+---+---+---+---+---+---+---+
And here’s the same for a dynamic 2D array (std::vector<std::vector<T>>):
+---+---+---+
| * | * | * |
+-|-+-|-+-|-+
| | V
| | +---+---+---+
| | | | | |
| | +---+---+---+
| V
| +---+---+---+
| | | | |
| +---+---+---+
V
+---+---+---+
| | | |
+---+---+---+
Clearly the 2D case loses the cache locality and uses more memory. It also introduces an extra indirection (and thus an extra pointer to follow) but the first array has the overhead of calculating the indices so these even out more or less.

1D and 2D Static Arrays
Size: Both will require the same amount of memory.
Speed: You can assume that there will be no speed difference because the memory for both of these arrays should be contiguous (The
whole 2D array should appear as one chunk in memory rather than a
bunch of chunks spread across memory). (This could be compiler
dependent however.)
1D and 2D Dynamic Arrays
Size: The 2D array will require a tiny bit more memory than the 1D array because of the pointers needed in the 2D array to point to the set of allocated 1D arrays. (This tiny bit is only tiny when we're talking about really big arrays. For small arrays, the tiny bit could be pretty big relatively speaking.)
Speed: The 1D array may be faster than the 2D array because the memory for the 2D array would not be contiguous, so cache misses would become a problem.
Use what works and seems most logical, and if you face speed problems, then refactor.

The existing answers all only compare 1-D arrays against arrays of pointers.
In C (but not C++) there is a third option; you can have a contiguous 2-D array that is dynamically allocated and has runtime dimensions:
int (*p)[num_columns] = malloc(num_rows * sizeof *p);
and this is accessed like p[row_index][col_index].
I would expect this to have very similar performance to the 1-D array case, but it gives you nicer syntax for accessing the cells.
In C++ you can achieve something similar by defining a class which maintains a 1-D array internally, but can expose it via 2-D array access syntax using overloaded operators. Again I would expect that to have similar or identical performance to the plain 1-D array.

Another difference of 1D and 2D arrays appears in memory allocation. We cannot be sure that members of 2D array be sequental.

It really depends on how your 2D array is implemented.
consider the code below:
int a[200], b[10][20], *c[10], *d[10];
for (ii = 0; ii < 10; ++ii)
{
c[ii] = &b[ii][0];
d[ii] = (int*) malloc(20 * sizeof(int)); // The cast for C++ only.
}
There are 3 implementations here: b, c and d
There won't be a lot of difference accessing b[x][y] or a[x*20 + y], since one is you doing the computation and the other is the compiler doing it for you. c[x][y] and d[x][y] are slower, because the machine has to find the address that c[x] points to and then access the yth element from there. It is not one straight computation. On some machines (eg AS400 which has 36 byte (not bit) pointers), pointer access is extremely slow. It all depends on the architecture in use. On x86 type architectures, a and b are the same speed, c and d are slower than b.

I love the thorough answer provided by Pixelchemist. A simpler version of this solution may be as follows. First, declare the dimensions:
constexpr int M = 16; // rows
constexpr int N = 16; // columns
constexpr int P = 16; // planes
Next, create an alias and, get and set methods:
template<typename T>
using Vector = std::vector<T>;
template<typename T>
inline T& set_elem(vector<T>& m_, size_t i_, size_t j_, size_t k_)
{
// check indexes here...
return m_[i_*N*P + j_*P + k_];
}
template<typename T>
inline const T& get_elem(const vector<T>& m_, size_t i_, size_t j_, size_t k_)
{
// check indexes here...
return m_[i_*N*P + j_*P + k_];
}
Finally, a vector may be created and indexed as follows:
Vector array3d(M*N*P, 0); // create 3-d array containing M*N*P zero ints
set_elem(array3d, 0, 0, 1) = 5; // array3d[0][0][1] = 5
auto n = get_elem(array3d, 0, 0, 1); // n = 5
Defining the vector size at initialization provides optimal performance. This solution is modified from this answer. The functions may be overloaded to support varying dimensions with a single vector. The downside of this solution is that the M, N, P parameters are implicitly passed to the get and set functions. This can be resolved by implementing the solution within a class, as done by Pixelchemist.

Related

Overloaded vector operators causing a massive performance reduction?

I am summing and multiplying vectors by a constant many many times so I overloaded the operators * and +. However working with vectors greatly slowed down my program. Working with a standard C-array improved the time by a factor of 40. What would cause such a slow down?
An example program showing my overloaded operators and exhibiting the slow-down is below. This program does k = k + (0.0001)*q, log(N) times (here N = 1000000). At the end the program prints the times to do the operations using vectors and c-arrays, and also the ratio of the times.
#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <time.h>
#include <vector>
using namespace std;
// -------- OVERLOADING VECTOR OPERATORS ---------------------------
vector<double> operator*(const double a,const vector<double> & vec)
{
vector<double> result;
for(int i = 0; i < vec.size(); i++)
result.push_back(a*vec[i]);
return result;
}
vector<double> operator+(const vector<double> & lhs,
const vector<double> & rhs)
{
vector<double> result;
for(int i = 0; i < lhs.size();i++)
result.push_back(lhs[i]+rhs[i]);
return result;
}
//------------------------------------------------------------------
//--------------- Basic C-Array operations -------------------------
// s[k] = y[k];
void populate_array(int DIM, double *y, double *s){
for(int k=0;k<DIM;k++)
s[k] = y[k];
}
//sums the arrays y and s as y+c s and sends them to s;
void sum_array(int DIM, double *y, double *s, double c){
for(int k=0;k<DIM;k++)
s[k] = y[k] + c*s[k];
}
// sums the array y and s as a*y+c*s and sends them to s;
void sum_array2(int DIM, double *y, double *s,double a,double c){
for(int k=0;k<DIM;k++)
s[k] = a*y[k] + c*s[k];
}
//------------------------------------------------------------------
int main(){
vector<double> k = {1e-8,2e-8,3e-8,4e-8};
vector<double> q = {1e-8,2e-8,3e-8,4e-8};
double ka[4] = {1e-8,2e-8,3e-8,4e-8};
double qa[4] = {1e-8,2e-8,3e-8,4e-8};
int N = 3;
clock_t begin,end;
double elapsed_sec,elapsed_sec2;
begin = clock();
do
{
k = k + 0.0001*q;
N = 2*N;
}while(N<1000000);
end = clock();
elapsed_sec = double(end-begin) / CLOCKS_PER_SEC;
printf("vector time: %g \n",elapsed_sec);
N = 3;
begin = clock();
do
{
sum_array2(4, qa, ka,0.0001,1.0);
N = 2*N;
}while(N<1000000);
end = clock();
elapsed_sec2 = double(end-begin) / CLOCKS_PER_SEC;
printf("array time: %g \n",elapsed_sec2);
printf("time ratio : %g \n", elapsed_sec/elapsed_sec2);
}
I get the ratio of vector time to c-array timeto be typically ~40 on my linux system. What is it about my overload operators that causes the slowdown compared to C-array operations?
Let's take a look at this line:
k = k + 0.0001*q;
To evaluate this, first the computer needs to call your operator*. This function creates a vector and needs to allocate dynamic storage for its elements. Actually, since you use push_back rather than setting the size ahead of time via constructor, resize, or reserve, it might allocate too few elements the first time and need to allocate again to grow the vector.
This created vector (or one move-constructed from it) is then used as a temporary object representing the subexpression 0.0001*q within the whole statement.
Next the computer needs to call your operator+, passing k and that temporary vector. This function also creates and returns a vector, doing at least one dynamic allocation and possibly more. There's a second temporary vector for the subexpression k + 0.0001*q.
Finally, the computer calls an operator= belonging to std::vector. Luckily, there's a move assignment overload, which (probably) just moves the allocated memory from the second temporary to k and deallocates the memory that was in k.
Now that the entire statement has been evaluated, the temporary objects are destroyed. First the temporary for k + 0.0001*q is destroyed, but it no longer has any memory to clean up. Then the temporary for 0.0001*q is destroyed, and it does need to deallocate its memory.
Doing lots of allocating and deallocating of memory, even in small amounts, tends to be somewhat expensive. (The vectors will use std::allocator, which is allowed to be smarter and avoid some allocations and deallocations, but I couldn't say without investigation how likely it would be to actually help here.)
On the other hand, your "C-style" implementation does no allocating or deallocating at all. It does an "in-place" calculation, just modifying arrays passed in to store the values passed out. If you had another C-style implementation with individual functions like double* scalar_times_vec(double s, const double* v, unsigned int len); that used malloc to get memory for the result and required the results to eventually be freed, you would probably get similar results.
So how might the C++ implementation be improved?
As mentioned, you could either reserve the vectors before adding data to them, or give them an initial size and do assignments like v[i] = out; rather than push_back(out);.
The next easiest step would be to use more operators that allow in-place calculations. If you overloaded:
std::vector<double>& operator+=(const std::vector<double>&);
std::vector<double>& operator*=(double);
then you could do:
k += 0.0001*q;
n *= 2;
// or:
n += n;
to do the final computations on k and n in-place. This doesn't easily help with the expression 0.0001*q, though.
Another option that sometimes helps is to overload operators to accept rvalues in order to reuse storage that belonged to temporaries. If we had an overload:
std::vector<double> operator+(const std::vector<double>& a, std::vector<double>&& b);
it would get called for the + in the expression k + 0.0001*q, and the implementation could create the return value from std::move(b), reusing its storage. This gets tricky to get both flexible and correct, though. And it still doesn't eliminate the temporary representing 0.0001*q or its allocation and deallocation.
Another solution that allows in-place calculations in the most general cases is called expression templates. That's rather a lot of work to implement, but if you really need a combination of convenient syntax and efficiency, there are some existing libraries that might be worth looking into.
Edit:
I should have taken a closer look on how you perform the c-array operations... See aschepler's answer on why growing the vectors is the least of your problems.
–––
If you have any idea how many elements you are going to add to a vector, you should always call reserve on the vector before adding them. Otherwise you are going to trigger a potentially large amount of reallocations, which are costly.
A vector occupies a continuous block of memory. To grow, it has to allocate a larger block of memory and copy its entire content to the new location. To avoid this happening every time a element is added, the vector usually allocates more memory than is presently needed to store all its elements. The number of elements it can store without reallocation is its capacity. How large this capacity should be is of course a trade off between avoiding potential future reallocation and wasting memory.
However, if you know (or have a good idea) how many elements will eventually be stored in the vector, you can call reserve(n) to set its capacity to (at least) n and avoid unecessary reallocation.
Edit :
See also here. push_back performes a bound check and is thus a bit slower than just writing to the vector through operator[]. In your case it might be fastest to directly construct a vector of size (not just capacity) n, as doubles are POD and cheap to construct, and than insert the correct values through operator[].

use std::vector for dynamically allocated 2d array?

So I am writing a class, which has 1d-arrays and 2d-arrays, that I dynamically allocate in the constructor
class Foo{
int** 2darray;
int * 1darray;
};
Foo::Foo(num1, num2){
2darray = new int*[num1];
for(int i = 0; i < num1; i++)
{
array[i] = new int[num2];
}
1darray = new int[num1];
}
Then I will have to delete every 1d-array and every array in the 2d array in the destructor, right?
I want to use std::vector for not having to do this. Is there any downside of doing this? (makes compilation slower etc?)
TL;DR: when to use std::vector for dynamically allocated arrays, which do NOT need to be resized during runtime?
vector is fine for the vast majority of uses. Hand-tuned scenarios should first attempt to tune the allocator1, and only then modify the container. Correctness of memory management (and your program in general) is worth much, much more than any compilation time gains.
In other words, vector should be your starting point, and until you find it unsatisfactory, you shouldn't care about anything else.
As an additional improvement, consider using a 1-dimensional vector as a backend storage and only provide 2-dimensional indexed view. This scenario can improve the cache locality and overall performance, while also making some operations like copying of the whole structure much easier.
1 the second of two template parameters that vector accepts, which defaults to a standard allocator for a given type.
There should not be any drawbacks since vector guarantees contiguous memory. But if the size is fixed and C++11 is available maybe an array among other options:
it doesn't allow resizing
depending on how the vector is initialized prevents reallocations
size is hardcoded in the instructions (template argument). See Ped7g comment for a more detailed description
An 2D array is not a array of pointers.
If you define it this way, each row/colum can have a different size.
Furthermore the elements won't be in sequence in memory.
This might lead to poor performance as the prefetcher wont be able to predict your access-patterns really well.
Therefore it is not advised to nest std::vectors inside eachother to model multi-dimensional arrays.
A better approach is to map an continuous chunk of memory onto an mult-dimensional space by providing custom access methods.
You can test it in the browser: http://fiddle.jyt.io/github/3389bf64cc6bd7c2218c1c96f62fa203
#include<vector>
template<class T>
struct Matrix {
Matrix(std::size_t n=1, std::size_t m=1)
: n{n}, m{m}, data(n*m)
{}
Matrix(std::size_t n, std::size_t m, std::vector<T> const& data)
: n{n}, m{m}, data{data}
{}
//Matrix M(2,2, {1,1,1,1});
T const& operator()(size_t i, size_t j) const {
return data[i*m + j];
}
T& operator()(size_t i, size_t j) {
return data[i*m + j];
}
size_t n;
size_t m;
std::vector<T> data;
using ScalarType = T;
};
You can implement operator[] by returning a VectorView which has access to data an index and the dimensions.

Memory layout : 2D N*M data as pointer to N*M buffer or as array of N pointers to arrays

I'm hesitating on how to organize the memory layout of my 2D data.
Basically, what I want is an N*M 2D double array, where N ~ M are in the thousands (and are derived from user-supplied data)
The way I see it, I have 2 choices :
double *data = new double[N*M];
or
double **data = new double*[N];
for (size_t i = 0; i < N; ++i)
data[i] = new double[M];
The first choice is what I'm leaning to.
The main advantages I see are shorter new/delete syntax, continuous memory layout implies adjacent memory access at runtime if I arrange my access correctly, and possibly better performance for vectorized code (auto-vectorized or use of vector libraries such as vDSP or vecLib)
On the other hand, it seems to me that allocating a big chunk of continuous memory could fail/take more time compared to allocating a bunch of smaller ones. And the second method also has the advantage of the shorter syntax data[i][j] compared to data[i*M+j]
What would be the most common / better way to do this, mainly if I try to view it from a performance standpoint (even though those are gonna be small improvements, I'm curious to see which would more performing).
Between the first two choices, for reasonable values of M and N, I would almost certainly go with choice 1. You skip a pointer dereference, and you get nice caching if you access data in the right order.
In terms of your concerns about size, we can do some back-of-the-envelope calculations.
Since M and N are in the thousands, suppose each is 10000 as an upper bound. Then your total memory consumed is
10000 * 10000 * sizeof(double) = 8 * 10^8
This is roughly 800 MB, which while large, is quite reasonable given the size of memory in modern day machines.
If N and M are constants, it is better to just statically declare the memory you need as a two dimensional array. Or, you could use std::array.
std::array<std::array<double, M>, N> data;
If only M is a constant, you could use a std::vector of std::array instead.
std::vector<std::array<double, M>> data(N);
If M is not constant, you need to perform some dynamic allocation. But, std::vector can be used to manage that memory for you, so you can create a simple wrapper around it. The wrapper below returns a row intermediate object to allow the second [] operator to actually compute the offset into the vector.
template <typename T>
class matrix {
const size_t N;
const size_t M;
std::vector<T> v_;
struct row {
matrix &m_;
const size_t r_;
row (matrix &m, size_t r) : m_(m), r_(r) {}
T & operator [] (size_t c) { return m_.v_[r_ * m_.M + c]; }
T operator [] (size_t c) const { return m_.v_[r_ * m_.M + c]; }
};
public:
matrix (size_t n, size_t m) : N(n), M(m), v_(N*M) {}
row operator [] (size_t r) { return row(*this, r); }
const row & operator [] (size_t r) const { return row(*this, r); }
};
matrix<double> data(10,20);
data[1][2] = .5;
std::cout << data[1][2] << '\n';
In addressing your particular concern about performance: Your rationale for wanting a single memory access is correct. You should want to avoid doing new and delete yourself, however (which is something this wrapper provides), and if the data is more naturally interpreted as multi-dimensional, then showing that in the code will make the code easier to read as well.
Multiple allocations as shown in your second technique is inferior because it will take more time, but its advantage is that it may succeed more often if your system is fragmented (the free memory consists of smaller holes, and you do not have a free chunk of memory large enough to satisfy the single allocation request). But multiple allocations has another downside in that some more memory is needed to allocate space for the pointers to each row.
My suggestion provides the single allocation technique without needed to explicitly call new and delete, as the memory is managed by vector. At the same time, it allows the data to be addressed with the 2-dimensional syntax [x][y]. So it provides all the benefits of a single allocation with all the benefits of the multi-allocation, provided you have enough memory to fulfill the allocation request.
Consider using something like the following:
// array of pointers to doubles to point the beginning of rows
double ** data = new double*[N];
// allocate so many doubles to the first row, that it is long enough to feed them all
data[0] = new double[N * M];
// distribute pointers to individual rows as well
for (size_t i = 1; i < N; i++)
data[i] = data[0] + i * M;
I'm not sure if this is a general practice or not, I just came up with this. Some downs still apply to this approach, but I think it eliminates most of them, like being able to access the individual doubles like data[i][j] and all.

std::vector and contiguous memory of multidimensional arrays

I know that the standard does not force std::vector to allocate contiguous memory blocks, but all implementations obey this nevertheless.
Suppose I wish to create a vector of a multidimensional, static array. Consider 2 dimensions for simplicity, and a vector of length N. That is I wish to create a vector with N elements of, say, int[5].
Can I be certain that all N*5 integers are now contiguous in memory? So that I in principle could access all of the integers simply by knowing the address of the first element? Is this implementation dependent?
For reference the way I currently create a 2D array in a contiguous memory block is by first making a (dynamic) array of float* of length N, allocating all N*5 floats in one array and then copying the address of every 5th element into the first array of float*.
The standard does require the memory of an std::vector to be
contiguous. On the other hand, if you write something like:
std::vector<std::vector<double> > v;
the global memory (all of the v[i][j]) will not be contiguous. The
usual way of creating 2D arrays is to use a single
std::vector<double> v;
and calculate the indexes, exactly as you suggest doing with float.
(You can also create a second std::vector<float*> with the addresses
if you want. I've always just recalculated the indexes, however.)
Elements of a Vector are gauranteed to be contiguous as per C++ standard.
Quotes from the standard are as follows:
From n2798 (draft of C++0x):
23.2.6 Class template vector [vector]
1 A vector is a sequence container that supports random access iterators. In addition, it supports (amortized) constant time insert and erase operations at the end; insert and erase in the middle take linear time. Storage management is handled automatically, though hints can be given to improve efficiency. The elements of a vector are stored contiguously, meaning that if v is a vector where T is some type other than bool, then it obeys the identity &v[n] == &v[0] + n for all 0 <= n < v.size().
C++03 standard (23.2.4.1):
The elements of a vector are stored contiguously, meaning that if v is a vector where T is some type other than bool, then it obeys the identity &v[n] == &v[0] + n for all 0 <= n < v.size().
Also, see here what Herb Sutter's views on the same.
As #Als already pointed out, yes, std::vector (now) guarantees contiguous allocation. I would not, however, simulate a 2D matrix with an array of pointers. Instead, I'd recommend one of two approaches. The simpler by (by far) is to just use operator() for subscripting, and do a multiplication to convert the 2D input to a linear address in your vector:
template <class T>
class matrix2D {
std::vector<T> data;
int columns;
public:
T &operator()(int x, int y) {
return data[y * columns + x];
}
matrix2D(int x, int y) : data(x*y), columns(x) {}
};
If, for whatever reason, you want to use matrix[a][b] style addressing, you can use a proxy class to handle the conversion. Though it was for a 3D matrix instead of 2D, I posted a demonstration of this technique in previous answer.
For reference the way I currently create a 2D array in a contiguous memory block is by first making a (dynamic) array of float* of length N, allocating all N*5 floats in one array and then copying the address of every 5th element into the first array of float*.
That's not a 2D array, that's an array of pointers. If you want a real 2D array, this is how it's done:
float (*p)[5] = new float[N][5];
p [0] [0] = 42; // access first element
p[N-1][4] = 42; // access last element
delete[] p;
Note there is only a single allocation. May I suggest reading more about using arrays in C++?
Under the hood, a vector may look approximately like (p-code):
class vector<T> {
T *data;
size_t s;
};
Now if you make a vector<vector<T> >, there will be a layout like this
vector<vector<T>> --> data {
vector<T>,
vector<T>,
vector<T>
};
or in "inlined" form
vector<vector<T>> --> data {
{data0, s0},
{data1, s1},
{data2, s2}
};
Yes, the vector-vector therefore uses contiguous memory, but no, not as you'd like it. It most probably stores an array of pointers (and some other variables) to external places.
The standard only requires that the data of a vector is contiguous, but not the vector as a whole.
A simple class to create, as you call it, a 2D array, would be something like:
template <class T> 2DArray {
private:
T *m_data;
int m_stride;
public:
2DArray(int dimY, int dimX) : m_stride(dimX) : m_data(new[] T[dimX * dimY]) {}
~2DArray() { delete[] m_data; }
T* operator[](int row) { return m_data + m_stride * row; }
}
It's possible to use this like:
2DArray<int> myArray(30,20);
for (int i = 0; i < 30; i++)
for (int j = 0; j < 20; j++)
myArray[i][j] = i + j;
Or even pass &myArray[0][0] as address to low-level functions that take some sort of "flat buffers".
But as you see, it turns naive expectations around in that it's myarray[y][x].
Generically, if you interface with code that requires some sort of classical C-style flat array, then why not just use that ?
Edit: As said, the above is simple. No bounds check attempts whatsoever. Just like, "an array".

Nested STL vector using way too much memory

I have an STL vector My_Partition_Vector of Partition objects, defined as
struct Partition // the event log data structure
{
int key;
std::vector<std::vector<char> > partitions;
float modularity;
};
The actual nested structure of Partition.partitions varies from object to object but in the total number of chars stored in Partition.partitions is always 16.
I assumed therefore that the total size of the object should be more or less 24 bytes (16 + 4 + 4). However for every 100,000 items I add to My_Partition_Vector, memory consumption (found using ps -aux) increases by around 20 MB indicating around 209 bytes for each Partition Object.
This is a nearly 9 Fold increase!? Where is all this extra memory usage coming from? Some kind of padding in the STL vector, or the struct? How can I resolve this (and stop it reaching into swap)?
For one thing std::vector models a dynamic array so if you know that you'll always have 16 chars in partitions using std::vector is overkill. Use a good old C style array/matrix, boost::array or boost::multi_array.
To reduce the number of re-allocations needed for inserting/adding elements due to it's memory layout constrains std::vector is allowed to preallocate memory for a certain number of elements upfront (and it's capacity() member function will tell you how much).
While I think he may be overstating the situation just a tad, I'm in general agreement with DeadMG's conclusion that what you're doing is asking for trouble.
Although I'm generally the one looking at (whatever mess somebody has made) and saying "don't do that, just use a vector", this case might well be an exception. You're creating a huge number of objects that should be tiny. Unfortunately, a vector typically looks something like this:
template <class T>
class vector {
T *data;
size_t allocated;
size_t valid;
public:
// ...
};
On a typical 32-bit machine, that's twelve bytes already. Since you're using a vector<vector<char> >, you're going to have 12 bytes for the outer vector, plus twelve more for each vector it holds. Then, when you actually store any data in your vectors, each of those needs to allocate a block of memory from the free store. Depending on how your free store is implemented, you'll typically have a minimum block size -- frequently 32 or even 64 bytes. Worse, the heap typically has some overhead of its own, so it'll add some more memory onto each block, for its own book-keeping (e.g., it might use a linked list of blocks, adding another pointer worth of data to each allocation).
Just for grins, let's assume you average four vectors of four bytes apiece, and that your heap manager has a 32-byte minimum block size and one extra pointer (or int) for its bookkeeping (giving a real minimum of 36 bytes per block). Multiplying that out, I get 204 bytes apiece -- close enough to your 209 to believe that's reasonably close to what you're dealing with.
The question at that point is how to deal with the problem. One possibility is to try to work behind the scenes. All the containers in the standard library use allocators to get their memory. While they default allocator gets memory directly from the free store, you can substitute a different one if you choose. If you do some looking around, you can find any number of alternative allocators, many/most of which are to help with exactly the situation you're in -- reducing wasted memory when allocating lots of small objects. A couple to look at would be the Boost Pool Allocator and the Loki small object allocator.
Another possibility (that can be combined with the first) would be to quit using a vector<vector<char> > at all, and replace it with something like:
char partitions[16];
struct parts {
int part0 : 4;
int part1 : 4;
int part2 : 4;
int part3 : 4;
int part4 : 4;
int part5 : 4;
int part6 : 4
int part7 : 4;
};
For the moment, I'm assuming a maximum of 8 partitions -- if it could be 16, you can add more to parts. This should probably reduce memory usage quite a bit more, but (as-is) will affect your other code. You could also wrap this up into a small class of its own that provides 2D-style addressing to minimize impact on the rest of your code.
If you store a near constant amount of objects, then I suggest to use a 2-dimensional array.
The most likely reason for the memory consumption is debug data. STL implementations usually store A LOT of debug data. Never profile an application with debug flags on.
...This is a bit of a side conversation, but boost::multi_array was suggested as an alternative to the OP's use of nested vectors. My finding was that multi_array was using a similar amount of memory when applied to the OP's operating parameters.
I derived this code from the example at Boost.MultiArray. On my machine, this showed multi_array using about 10x more memory than ideally required assuming that the 16 bytes are arranged in a simple rectangular geometry.
To evaluate the memory usage, I checked the system monitor while the program was running and I compiled with
( export CXXFLAGS="-Wall -DNDEBUG -O3" ; make main && ./main )
Here's the code...
#include <iostream>
#include <vector>
#include "boost/multi_array.hpp"
#include <tr1/array>
#include <cassert>
#define USE_CUSTOM_ARRAY 0 // compare memory usage of my custom array vs. boost::multi_array
using std::cerr;
using std::vector;
#ifdef USE_CUSTOM_ARRAY
template< typename T, int YSIZE, int XSIZE >
class array_2D
{
std::tr1::array<char,YSIZE*XSIZE> data;
public:
T & operator () ( int y, int x ) { return data[y*XSIZE+x]; } // preferred accessor (avoid pointers)
T * operator [] ( int index ) { return &data[index*XSIZE]; } // alternative accessor (mimics boost::multi_array syntax)
};
#endif
int main ()
{
int COUNT = 1024*1024;
#if USE_CUSTOM_ARRAY
vector< array_2D<char,4,4> > A( COUNT );
typedef int index;
#else
typedef boost::multi_array<char,2> array_type;
typedef array_type::index index;
vector<array_type> A( COUNT, array_type(boost::extents[4][4]) );
#endif
// Assign values to the elements
int values = 0;
for ( int n=0; n<COUNT; n++ )
for(index i = 0; i != 4; ++i)
for(index j = 0; j != 4; ++j)
A[n][i][j] = values++;
// Verify values
int verify = 0;
for ( int n=0; n<COUNT; n++ )
for(index i = 0; i != 4; ++i)
for(index j = 0; j != 4; ++j)
{
assert( A[n][i][j] == (char)((verify++)&0xFF) );
#if USE_CUSTOM_ARRAY
assert( A[n][i][j] == A[n](i,j) ); // testing accessors
#endif
}
cerr <<"spinning...\n";
while ( 1 ) {} // wait here (so you can check memory usage in the system monitor)
return 0;
}
On my system, sizeof(vector) is 24. This probably corresponds to 3 8-byte members: capacity, size, and pointer. Additionally, you need to consider the actual allocations which would be between 1 and 16 bytes (plus allocation overhead) for the inner vector and between 24 and 384 bytes for the outer vector ( sizeof(vector) * partitions.capacity() ).
I wrote a program to sum this up...
for ( int Y=1; Y<=16; Y++ )
{
const int X = 16/Y;
if ( X*Y != 16 ) continue; // ignore imperfect geometries
Partition a;
a.partitions = vector< vector<char> >( Y, vector<char>(X) );
int sum = sizeof(a); // main structure
sum += sizeof(vector<char>) * a.partitions.capacity(); // outer vector
for ( int i=0; i<(int)a.partitions.size(); i++ )
sum += sizeof(char) * a.partitions[i].capacity(); // inner vector
cerr <<"X="<<X<<", Y="<<Y<<", size = "<<sum<<"\n";
}
The results show how much memory (not including allocation overhead) is need for each simple geometry...
X=16, Y=1, size = 80
X=8, Y=2, size = 104
X=4, Y=4, size = 152
X=2, Y=8, size = 248
X=1, Y=16, size = 440
Look at the how the "sum" is calculated to see what all of the components are.
The results posted are based on my 64-bit architecture. If you have a 32-bit architecture the sizes would be almost half as much -- but still a lot more than what you had expected.
In conclusion, std::vector<> is not very space efficient for doing a whole bunch of very small allocations. If your application is required to be efficient, then you should use a different container.
My approach to solving this would probably be to allocate the 16 chars with
std::tr1::array<char,16>
and wrap that with a custom class that maps 2D coordinates onto the array allocation.
Below is a very crude way of doing this, just as an example to get you started. You would have to change this to meet your specific needs -- especially the ability to specify the geometry dynamically.
template< typename T, int YSIZE, int XSIZE >
class array_2D
{
std::tr1::array<char,YSIZE*XSIZE> data;
public:
T & operator () ( int y, int x ) { return data[y*XSIZE+x]; } // preferred accessor (avoid pointers)
T * operator [] ( int index ) { return &data[index*XSIZE]; } // alternative accessor (mimics boost::multi_array syntax)
};
16 bytes is a complete and total waste. You're storing a hell of a lot of data about very small objects. A vector of vector is the wrong solution to use. You should log sizeof(vector) - it's not insignificant, as it performs a substantial function. On my compiler, sizeof(vector) is 20. So each Partition is 4 + 4 + 16 + 20 + 20*number of inner partitions + memory overheads like the vectors not being the perfect size.
You're only storing 16 bytes of data, and wasting ridiculous amounts of memory allocating them in the most segregated, highest overhead way you could possibly think of. The vector doesn't use a lot of memory - you have a terrible design.