performance between [fixed array] and [pointer with memory allocation] - c++

I'm making scientific code that calculates very many times(10+ hours), so the speed is far more important than any other thing.
case 1
class foo{
public:
double arr[4] = {0};
...
foo& operator = (foo&& other){
std::memcpy(arr, other.arr, sizeof(arr));
}
...
}
case 2
class fee{
public:
double *arr = nullptr;
fee(){
arr = new double[4];
}
~fee(){
if(arr != nullptr)
free[] arr;
}
...
&fee operator = (fee&& other){
arr = other.arr;
other.arr = nullptr;
}
...
}
These classes are used for vector(length 4) and matrix(size 4x4) calculations.
I heard that arrays of fixed size can be optimized by the compiler.
But in that case, r-value calculations can not be optimized(since all elements have to be copied instead of pointer switching).
A = B*C + D;
So my question is what is more expensive, memory allocation and freeing or copying close memories?
Or perhaps there is another way to increase the performance(such as making an expression class)?

First performance is not really a language question (except for algorythms used in the standard library) but more an implementation question. Anyway most common implementations use the program stack for automatic variables and a system heap for dynamic ones (allocated through new).
In that case the performance will depend on the usage. Heap management has a cost. So if you are frequently allocating and deallocating them, stack management should be the winner. But on the other side moving allocated data is just a matter of pointer exchange when you may need a memcpy for non allocated one.
The total memory has also a strong impact. Heap memory is only limited by the free system memory (at run time), while the stack size is normally defined at build time (link phase) and statically allocated at load time. So if the total size is only known at run time, use dynamic memory.
You are here trying to do low level optimization. The rule is then to do profiling. Build a small program making the expected usage of those structures, and use a profiling tool(*) with both implementations. I would alse give a try to a standard vector which has nice built-in optimizations.
(*) Beware, simply measuring the time of one single run is not accurate because it depends of many other parameters such as the load caused by other programs including system ones.

Related

How fast is memory access to a vector compared to a normal array?

I have a function that is called thousands times a second (it's an audio effect) and I need a buffer for writing and reading audio data. Is there a considerable difference in performance between declaring a float array as a normal array or as a vector?
Once declared, my array is not resized during the audio loop, but at the initialization phase I don't know the exact length because it's dependant on the audio sampling rate. So, for example, if I need a 2 seconds audio buffer for a sampling rate of 44100 Hz, I normally do this:
declaration:
int size;
float *buffer;
void init (int sr)
{
size = sr * 2;
buffer = new float[size]();
}
~destroy()
{
delete [] buffer;
}
Allocating memory dynamically has a small cost, as does the later deallocation, but you've illustrated use of new already so your costs are equivalent to vector with an adequate initial size or reserve call in every way.
Once allocated the operations can be expected to be as fast in any optimised build, but you should profile yourself if you've any reason to care.
It's not relevant to your new-ing code, but FYI there at least a potential difference due to addressing - a global or static array might have the virtual address known at compile time, and a stack based array might be at a known offset from the stack pointer, but on most architectures there's no appreciable performance difference between those and indexing relative to a runtime-determined pointer.

2D array access time comparison

I have two ways of constructing a 2D array:
int arr[NUM_ROWS][NUM_COLS];
//...
tmp = arr[i][j]
and flattened array
int arr[NUM_ROWS*NUM_COLS];
//...
tmp = arr[i*NuM_COLS+j];
I am doing image processing so even a little improvement in access time is necessary. Which one is faster? I am thinking the first one since the second one needs calculation, but then the first one requires two addressing so I am not sure.
I don't think there is any performance difference. System will allocate same amount of contiguous memory in both cases. For calculate i*Numcols+j, either you would do it for 1D array declaration, or system would do it in 2D case. Only concern is ease of usage.
You should have trust into the capabilities of your compiler in optimizing standard code.
Also you should have trust into modern CPUs having fast numeric multiplication instructions.
Don't bother to use one or another!
I - decades ago - optimized some code greatly by using pointers instead of using 2d-array-calculation --> but this will a) only be useful if it is an option to store the pointer - e.g. in a loop and b) have low impact since i guess modern cpus should do 2d array access in a single cycle? Worth measuring! May be related to the array size.
In any case pointers using ptr++ or ptr += NuM_COLS will for sure be a little bit faster if applicable!
The first method will almost always be faster. IN GENERAL (because there are always corner cases) processor and memory architecture as well as compilers may have optimizations built in to aid with 2d arrays or other similar data structures. For example, GPUs are optimized for matrix (2d array) math.
So, again in general, I would allow the compiler and hardware to optimize your memory and address arithmetic if possible.
...also I agree with #Paul R, there are much bigger considerations when it comes to performance than your array allocation and address arithmetic.
There are two cases to consider: compile time definition and run-time definition of the array size. There is big difference in performance.
Static allocation, global or file scope, fixed size array:
The compiler knows the size of the array and tells the linker to allocate space in the data / memory section. This is the fastest method.
Example:
#define ROWS 5
#define COLUMNS 6
int array[ROWS][COLUMNS];
int buffer[ROWS * COLUMNS];
Run time allocation, function local scope, fixed size array:
The compiler knows the size of the array, and tells the code to allocate space in the local memory (a.k.a. stack) for the array. In general, this means adding a value to a stack register. Usually one or two instructions.
Example:
void my_function(void)
{
unsigned short my_array[ROWS][COLUMNS];
unsigned short buffer[ROWS * COLUMNS];
}
Run Time allocation, dynamic memory, fixed size array:
Again, the compiler has already calculated the amount of memory required for the array since it was declared with fixed size. The compiler emits code to call the memory allocation function with the required amount (usually passed as a parameter). A little slower because of the function call and the overhead required to find some dynamic memory (and maybe garbage collection).
Example:
void another_function(void)
{
unsigned char * array = new char [ROWS * COLS];
//...
delete[] array;
}
Run Time allocation, dynamic memory, variable size:
Regardless of the dimensions of the array, the compiler must emit code to calculate the amount of memory to allocate. This quantity is then passed to the memory allocation function. A little slower than above because of the code required to calculate the size.
Example:
int * create_board(unsigned int rows, unsigned int columns)
{
int * board = new int [rows * cols];
return board;
}
Since your goal is image processing then I would assume your images are too large for static arrays. The correct question you should be about dynamically allocated arrays
In C/C++ there are multiple ways you can allocate a dynamic 2D array How do I work with dynamic multi-dimensional arrays in C?. To make this work in both C/C++ we can use malloc with casting (for C++ only you can use new)
Method 1:
int** arr1 = (int**)malloc(NUM_ROWS * sizeof(int*));
for(int i=0; i<NUM_ROWS; i++)
arr[i] = (int*)malloc(NUM_COLS * sizeof(int));
Method 2:
int** arr2 = (int**)malloc(NUM_ROWS * sizeof(int*));
int* arrflat = (int*)malloc(NUM_ROWS * NUM_COLS * sizeof(int));
for (int i = 0; i < dimension1_max; i++)
arr2[i] = arrflat + (i*NUM_COLS);
Method 2 essentially creates a contiguous 2D array: i.e. arrflat[NUM_COLS*i+j] and arr2[i][j] should have identical performance. However, arrflat[NUM_COLS*i+j] and arr[i][j] from method 1 should not be expected to have identical performance since arr1 is not contiguous. Method 1, however, seems to be the method that is most commonly used for dynamic arrays.
In general, I use arrflat[NUM_COLS*i+j] so I don't have to think of how to allocated dynamic 2D arrays.

What's the meaning of __pagesize, __subpagesize, __malloc_header_size in basic_string's _S_create function

_S_create function in stl's basic_string uses these three variables to calculate an actual allocated size. Does anyone know why?
The interface of _S_create is
_Rep* _S_create(size_type __capacity, size_type __old_capacity, const _Alloc& __alloc)
Why it need the __old_capacity parameter?
PS. It is the gnu stl.
The actual code explains fairly well how it uses the _PageSize and _malloc_header_size, if you are just bothered to read the comments (and it's rather shameful NOT to read the comments, after all someone probably spent more time writing those comments than I did writing this answer):
// The standard places no restriction on allocating more memory
// than is strictly needed within this layer at the moment or as
// requested by an explicit application call to reserve().
// Many malloc implementations perform quite poorly when an
// application attempts to allocate memory in a stepwise fashion
// growing each allocation size by only 1 char. Additionally,
// it makes little sense to allocate less linear memory than the
// natural blocking size of the malloc implementation.
// Unfortunately, we would need a somewhat low-level calculation
// with tuned parameters to get this perfect for any particular
// malloc implementation. Fortunately, generalizations about
// common features seen among implementations seems to suffice.
// __pagesize need not match the actual VM page size for good
// results in practice, thus we pick a common value on the low
// side. __malloc_header_size is an estimate of the amount of
// overhead per memory allocation (in practice seen N * sizeof
// (void*) where N is 0, 2 or 4). According to folklore,
// picking this value on the high side is better than
// low-balling it (especially when this algorithm is used with
// malloc implementations that allocate memory blocks rounded up
// to a size which is a power of 2).
And Sebastian got it right (although it's written in a slightly different way):
// The below implements an exponential growth policy, necessary to
// meet amortized linear time requirements of the library: see
// http://gcc.gnu.org/ml/libstdc++/2001-07/msg00085.html.
// It's active for allocations requiring an amount of memory above
// system pagesize. This is consistent with the requirements of the
// standard: http://gcc.gnu.org/ml/libstdc++/2001-07/msg00130.html
if (__capacity > __old_capacity && __capacity < 2 * __old_capacity)
__capacity = 2 * __old_capacity;
the __pagesize is used here, to round it up a bit...
const size_type __adj_size = __size + __malloc_header_size;
if (__adj_size > __pagesize && __capacity > __old_capacity)
{
const size_type __extra = __pagesize - __adj_size % __pagesize;
__capacity += __extra / sizeof(_CharT);
// Never allocate a string bigger than _S_max_size.
if (__capacity > _S_max_size)
__capacity = _S_max_size;
__size = (__capacity + 1) * sizeof(_CharT) + sizeof(_Rep);
}
The implementation I have doesn't have a _subpagesize, but I expect that's a similar rounding thing.
Without looking at libstdc++ or asking anyone with definite information, one possible implementation is
size_type __actual_capacity = max(__capacity, 2 * __old_capacity);
// do other stuff to create a _Rep.
This ensures geometric growth while also guaranteeing at least as much memory as is definitely needed.

c++ dynamic allocated variables , what is the flow of execution?

I have few questions:
1) why when I created more than two dynamic allocated variables the difference between their memory address is 16 bytes. (I thought one of the advantages of using dynamic variables is saving memory, so when you delete unused variable it will free that memory); but if the difference between two dynamic variables is 16 bytes even using a short integer, then there a lot of memery that I will not benifit .
2) creating a dynamic allocated variable using new operator.
int x;
cin >> x;
int* a = new int(3);
int y = 4;
int z = 1;
In the e.g above. what is the flow of execution of this program. is it gonna store all variable likes x,a,y and z in the stack and then will store the value 3 in the address that a points to?
3) creating a dynamic alloated array.
int x;
cin >> x;
int* array = new int[x];
int y = 4;
int z = 1;
and the same question here.
4) does the size of the heap(free scope) depend on how much of memory im using in the code area,the stack are, and the global area ?
Storing small values like integers on the heap is fairly pointless because you use the same or more memory to store the pointer. The 16 byte alignment is just so the CPU can access the memory as efficiently as possible.
Yes, although the stack variables might be allocated to registers; that is up to the compiler.
Same as 2.
The size of the heap is controlled by the operating system and expanded as necessary as you allocate more memory.
Yes, in the examples, a and array are both "stack" variables. The data they point to is not.
I put stack in quotes because we are not going to concern ourselves with hardware detail here, but just the semantics. They have the semantics of stack variables.
The chunks of heap memory which you allocate need to store some housekeeping data so that the allocator (the code which works in behind of new) could work. The data usually includes chunk length and the address of next allocated chunk, among other things — depending on the actual allocator.
In your case, the service data are stored directly in front of (and, maybe, behind of, too) the actual allocated chunk. This (plus, likely, alignment) is the reason of 16 byte gap you observe.

memory usage of in class - converting double to float didn't reduce memory usage as expected

I am initializing millions of classes that are of the following type
template<class T>
struct node
{
//some functions
private:
T m_data_1;
T m_data_2;
T m_data_3;
node* m_parent_1;
node* m_parent_2;
node* m_child;
}
The purpose of the template is to enable the user to choose float or double precision, with the idea being that by node<float> will occupy less memory (RAM).
However, when I switch from double to float the memory footprint of my program does not decrease as I expect it to. I have two questions,
Is it possible that the compiler/operating system is reserving more space than required for my floats (or even storing them as a double). If so, how do I stop this happening - I'm using linux on 64 bit machine with g++.
Is there a tool that lets me determine the amount of memory used by all the different classes? (i.e. some sort of memory profiling) - to make sure that the memory isn't being goobled up somewhere else that I haven't thought of.
If you are compiling for 64-bit, then each pointer will be 64-bits in size. This also means that they may need to be aligned to 64-bits. So if you store 3 floats, it may have to insert 4 bytes of padding. So instead of saving 12 bytes, you only save 8. The padding will still be there whether the pointers are at the beginning of the struct or the end. This is necessary in order to put consecutive structs in arrays to continue to maintain alignment.
Also, your structure is primarily composed of 3 pointers. The 8 bytes you save take you from a 48-byte object to a 40 byte object. That's not exactly a massive decrease. Again, if you're compiling for 64-bit.
If you're compiling for 32-bit, then you're saving 12 bytes from a 36-byte structure, which is better percentage-wise. Potentially more if doubles have to be aligned to 8 bytes.
The other answers are correct about the source of the discrepancy. However, pointers (and other types) on x86/x86-64 are not required to be aligned. It is just that performance is better when they are, which is why GCC keeps them aligned by default.
But GCC provides a "packed" attribute to let you exert control over this:
#include <iostream>
template<class T>
struct node
{
private:
T m_data_1;
T m_data_2;
T m_data_3;
node* m_parent_1;
node* m_parent_2;
node* m_child;
} ;
template<class T>
struct node2
{
private:
T m_data_1;
T m_data_2;
T m_data_3;
node2* m_parent_1;
node2* m_parent_2;
node2* m_child;
} __attribute__((packed));
int
main(int argc, char *argv[])
{
std::cout << "sizeof(node<double>) == " << sizeof(node<double>) << std::endl;
std::cout << "sizeof(node<float>) == " << sizeof(node<float>) << std::endl;
std::cout << "sizeof(node2<float>) == " << sizeof(node2<float>) << std::endl;
return 0;
}
On my system (x86-64, g++ 4.5.2), this program outputs:
sizeof(node<double>) == 48
sizeof(node<float>) == 40
sizeof(node2<float>) == 36
Of course, the "attribute" mechanism and the "packed" attribute itself are GCC-specific.
In addtion to the valid points that Nicol makes:
When you call new/malloc, it doesn't necessarily correspond 1 to 1 with a call the the OS to allocate memory. This is because in order to reduce the number of expensive syste, calls, the heap manager may allocate more than is requested, and then "suballocate" chunks of that when you call new/malloc. Also, memory can only be allocated 4kb at a time (typically - this is the minimum page size). Essentially, there may be chunks of memory allocated that are not currently actively used, in order to speed up future allocations.
To answer your questions directly:
1) Yes, the runtime will very likely allocate more memory then you asked for - but this memory is not wasted, it will be used for future news/mallocs, but will still show up in "task manager" or whatever tool you use. No, it will not promote floats to doubles. The more allocations you make, the less likely this edge condition will be the cause of the size difference, and the items in Nicol's will dominate. For a smaller number of allocations, this item is likely to dominate (where "large" and "small" depends entirely on your OS and Kernel).
2) The windows task manager will give you the total memory allocated. Something like WinDbg will actually give you the virtual memory range chunks (usually allocated in a tree) that were allocated by the run-time. For Linux, I expect this data will be available in one of the files in the /proc directory associated with your process.