I'm having an issue with (specifically the MSFT VS 10.0 implementation of) std::unique_ptrs. When I create a std::list of them, I use twice as much memory as when I create a std::list of just the underlying object (note: this is a big object -- ~200 bytes, so it's not just an extra reference counter lying around).
In other words, if I run:
std::list<MyObj> X;
X.resize( 1000, MyObj());
my application will require half as much memory as when I run:
std::list<std::unique_ptr<MyObj>> X;
for ( int i=0; i<1000; i++ ) X.push_back(std::unique_ptr<MyObj>(new MyObj()));
I've checked out the MSFT implementation and I don't see anything obvious -- any one encountered this and have any ideas?
EDIT: Ok, to be a bit more clear/specific. This is clearly a Windows memory usage issue and I am obviously missing something. I have now tried the following:
Create a std::list of 100000 MyObj
Create a std::list of 100000 MyObj*
Create a std::list of 100000 int*
Create a std::list of 50000 int*
In each case, each add'l member of the list, whether a pointer or otherwise, is bloating my application by 4400(!) bytes. This is in a release, 64-bit build, without any debugging information included (Linker > Debugging > Generate Debug Info set to No).
I obviously need to research this a bit more to narrow it down to a smaller test case.
For those interested, I am determining application size using Process Explorer.
Turns out it was entirely heap fragmentation. How ridiculous. 4400 bytes per 8 byte object! I switched to pre-allocating and the problem went away entirely -- I am used to some inefficiency in relying on per-object allocation, but this was just ridiculous.
MyObj implementation below:
class MyObj
{
public:
MyObj() { memset(this,0,sizeof(MyObj)); }
double m_1;
double m_2;
double m_3;
double m_4;
double m_5;
double m_6;
double m_7;
double m_8;
double m_9;
double m_10;
double m_11;
double m_12;
double m_13;
double m_14;
double m_15;
double m_16;
double m_17;
double m_18;
double m_19;
double m_20;
double m_21;
double m_22;
double m_23;
CUnit* m_UnitPtr;
CUnitPos* m_UnitPosPtr;
};
The added memory is likely from heap inefficiencies - you have to pay extra for each block you allocate due to internal fragmentation and malloc data. You're performing twice the amount of allocations which is going to incur a penalty hit.
For instance, this:
for(int i = 0; i < 100; ++i) {
new int;
}
will use more memory than this:
new int[100];
Even though the amount allocated is the same.
Edit:
I'm getting around 13% more memory used using unique_ptr using GCC on Linux.
std::list<MyObj> contains N copies of your object (+ the information needed for the pointers of the list).
std::unique_ptr<MyObj> contains a pointer to a instance of your object. (It should only contain a MyObj*).
So a std::list<std::unique_ptr<MyObj>> is not directly equivalent to your first list. std::list<MyObj*> should give the same size as the std::unque_ptr list.
After verifying the implementation, the only thing that could be embedded next to the pointer of the object itself, could be the 'deleter', which in the default case is a empty object that calls operator delete.
Do you have a Debug or a Release build?
This isn't an answer, but it doesn't fit in a comment and it might be illustrative.
I cannot reproduce the claim (GCC 4.6.2). Take this code:
#include <memory>
#include <list>
struct Foo { char p[200]; };
int main()
{
//std::list<Foo> l1(100);
std::list<std::unique_ptr<Foo>> l2;
for (unsigned int i = 0; i != 100; ++i) l2.emplace_back(new Foo);
}
Enabling only l1 produces (in Valgrind):
total heap usage: 100 allocs, 100 frees, 20,800 bytes allocated
Enabling only l2 and the loop gives:
total heap usage: 200 allocs, 200 frees, 21,200 bytes allocated
The smart pointers take up exactly 4 × 100 bytes.
In both cases, /usr/bin/time -v gives:
Maximum resident set size (kbytes): 3136
Further more, pmap shows in both cases: total 2996K. To confirm, I changed the object size to 20000 and the number of elements to 10000. Now the numbers are 198404K vs 198484K: Exactly 80000B difference, 8B per unique pointer (presumably there's some 8B alignment going on in the allocator of the list). Under the same changes, the "maximum resident set size"s reported by time -v are now 162768 vs 164304.
Related
I'm making scientific code that calculates very many times(10+ hours), so the speed is far more important than any other thing.
case 1
class foo{
public:
double arr[4] = {0};
...
foo& operator = (foo&& other){
std::memcpy(arr, other.arr, sizeof(arr));
}
...
}
case 2
class fee{
public:
double *arr = nullptr;
fee(){
arr = new double[4];
}
~fee(){
if(arr != nullptr)
free[] arr;
}
...
&fee operator = (fee&& other){
arr = other.arr;
other.arr = nullptr;
}
...
}
These classes are used for vector(length 4) and matrix(size 4x4) calculations.
I heard that arrays of fixed size can be optimized by the compiler.
But in that case, r-value calculations can not be optimized(since all elements have to be copied instead of pointer switching).
A = B*C + D;
So my question is what is more expensive, memory allocation and freeing or copying close memories?
Or perhaps there is another way to increase the performance(such as making an expression class)?
First performance is not really a language question (except for algorythms used in the standard library) but more an implementation question. Anyway most common implementations use the program stack for automatic variables and a system heap for dynamic ones (allocated through new).
In that case the performance will depend on the usage. Heap management has a cost. So if you are frequently allocating and deallocating them, stack management should be the winner. But on the other side moving allocated data is just a matter of pointer exchange when you may need a memcpy for non allocated one.
The total memory has also a strong impact. Heap memory is only limited by the free system memory (at run time), while the stack size is normally defined at build time (link phase) and statically allocated at load time. So if the total size is only known at run time, use dynamic memory.
You are here trying to do low level optimization. The rule is then to do profiling. Build a small program making the expected usage of those structures, and use a profiling tool(*) with both implementations. I would alse give a try to a standard vector which has nice built-in optimizations.
(*) Beware, simply measuring the time of one single run is not accurate because it depends of many other parameters such as the load caused by other programs including system ones.
I have two ways of constructing a 2D array:
int arr[NUM_ROWS][NUM_COLS];
//...
tmp = arr[i][j]
and flattened array
int arr[NUM_ROWS*NUM_COLS];
//...
tmp = arr[i*NuM_COLS+j];
I am doing image processing so even a little improvement in access time is necessary. Which one is faster? I am thinking the first one since the second one needs calculation, but then the first one requires two addressing so I am not sure.
I don't think there is any performance difference. System will allocate same amount of contiguous memory in both cases. For calculate i*Numcols+j, either you would do it for 1D array declaration, or system would do it in 2D case. Only concern is ease of usage.
You should have trust into the capabilities of your compiler in optimizing standard code.
Also you should have trust into modern CPUs having fast numeric multiplication instructions.
Don't bother to use one or another!
I - decades ago - optimized some code greatly by using pointers instead of using 2d-array-calculation --> but this will a) only be useful if it is an option to store the pointer - e.g. in a loop and b) have low impact since i guess modern cpus should do 2d array access in a single cycle? Worth measuring! May be related to the array size.
In any case pointers using ptr++ or ptr += NuM_COLS will for sure be a little bit faster if applicable!
The first method will almost always be faster. IN GENERAL (because there are always corner cases) processor and memory architecture as well as compilers may have optimizations built in to aid with 2d arrays or other similar data structures. For example, GPUs are optimized for matrix (2d array) math.
So, again in general, I would allow the compiler and hardware to optimize your memory and address arithmetic if possible.
...also I agree with #Paul R, there are much bigger considerations when it comes to performance than your array allocation and address arithmetic.
There are two cases to consider: compile time definition and run-time definition of the array size. There is big difference in performance.
Static allocation, global or file scope, fixed size array:
The compiler knows the size of the array and tells the linker to allocate space in the data / memory section. This is the fastest method.
Example:
#define ROWS 5
#define COLUMNS 6
int array[ROWS][COLUMNS];
int buffer[ROWS * COLUMNS];
Run time allocation, function local scope, fixed size array:
The compiler knows the size of the array, and tells the code to allocate space in the local memory (a.k.a. stack) for the array. In general, this means adding a value to a stack register. Usually one or two instructions.
Example:
void my_function(void)
{
unsigned short my_array[ROWS][COLUMNS];
unsigned short buffer[ROWS * COLUMNS];
}
Run Time allocation, dynamic memory, fixed size array:
Again, the compiler has already calculated the amount of memory required for the array since it was declared with fixed size. The compiler emits code to call the memory allocation function with the required amount (usually passed as a parameter). A little slower because of the function call and the overhead required to find some dynamic memory (and maybe garbage collection).
Example:
void another_function(void)
{
unsigned char * array = new char [ROWS * COLS];
//...
delete[] array;
}
Run Time allocation, dynamic memory, variable size:
Regardless of the dimensions of the array, the compiler must emit code to calculate the amount of memory to allocate. This quantity is then passed to the memory allocation function. A little slower than above because of the code required to calculate the size.
Example:
int * create_board(unsigned int rows, unsigned int columns)
{
int * board = new int [rows * cols];
return board;
}
Since your goal is image processing then I would assume your images are too large for static arrays. The correct question you should be about dynamically allocated arrays
In C/C++ there are multiple ways you can allocate a dynamic 2D array How do I work with dynamic multi-dimensional arrays in C?. To make this work in both C/C++ we can use malloc with casting (for C++ only you can use new)
Method 1:
int** arr1 = (int**)malloc(NUM_ROWS * sizeof(int*));
for(int i=0; i<NUM_ROWS; i++)
arr[i] = (int*)malloc(NUM_COLS * sizeof(int));
Method 2:
int** arr2 = (int**)malloc(NUM_ROWS * sizeof(int*));
int* arrflat = (int*)malloc(NUM_ROWS * NUM_COLS * sizeof(int));
for (int i = 0; i < dimension1_max; i++)
arr2[i] = arrflat + (i*NUM_COLS);
Method 2 essentially creates a contiguous 2D array: i.e. arrflat[NUM_COLS*i+j] and arr2[i][j] should have identical performance. However, arrflat[NUM_COLS*i+j] and arr[i][j] from method 1 should not be expected to have identical performance since arr1 is not contiguous. Method 1, however, seems to be the method that is most commonly used for dynamic arrays.
In general, I use arrflat[NUM_COLS*i+j] so I don't have to think of how to allocated dynamic 2D arrays.
I am receiving the error "User store segfault # 0x000000007feff598" for a large convolution operation.
I have defined the resultant array as
int t3_isize = 0;
int t3_irowcount = 0;
t3_irowcount=atoi(argv[2]);
t3_isize = atoi(argv[3]);
int iarray_size = t3_isize*t3_irowcount;
uint64_t t_result[iarray_size];
I noticed that if the array size is less than 2^16 - 1, the operation doesn't fail, but for the array size 2^16 or higher, I get the segfault error.
Any idea why this is happening? And how can i rectify this?
“I noticed that if the array size is greater than 2^16 - 1, the operation doesn't fail, but for the array size 2^16 or higher, I get the segfault error”
↑ Seems a bit self-contradictory.
But probably you're just allocating a too large array on the stack. Using dynamic memory allocation (e.g., just switch to using std::vector) you avoid that problem. For example:
std::vector<uint64_t> t_result(iarray_size);
In passing, I would ditch the Hungarian notation-like prefixes. For example, t_ reads like this is a type. The time for Hungarian notation was late 1980's, and its purpose was to support Microsoft's Programmer's Workbench, a now dicontinued (for very long) product.
You're probably declaring too large of an array for the stack. 216 elements of 8 bytes each is quite a lot (512K bytes).
If you just need static allocation, move the array to file scope.
Otherwise, consider using std::vector, which will allocate storage from the heap and manage it for you.
Using malloc() solved the issue.
uint64_t* t_result = (uint64_t*) malloc(sizeof(uint64_t)*iarray_size);
I have few questions:
1) why when I created more than two dynamic allocated variables the difference between their memory address is 16 bytes. (I thought one of the advantages of using dynamic variables is saving memory, so when you delete unused variable it will free that memory); but if the difference between two dynamic variables is 16 bytes even using a short integer, then there a lot of memery that I will not benifit .
2) creating a dynamic allocated variable using new operator.
int x;
cin >> x;
int* a = new int(3);
int y = 4;
int z = 1;
In the e.g above. what is the flow of execution of this program. is it gonna store all variable likes x,a,y and z in the stack and then will store the value 3 in the address that a points to?
3) creating a dynamic alloated array.
int x;
cin >> x;
int* array = new int[x];
int y = 4;
int z = 1;
and the same question here.
4) does the size of the heap(free scope) depend on how much of memory im using in the code area,the stack are, and the global area ?
Storing small values like integers on the heap is fairly pointless because you use the same or more memory to store the pointer. The 16 byte alignment is just so the CPU can access the memory as efficiently as possible.
Yes, although the stack variables might be allocated to registers; that is up to the compiler.
Same as 2.
The size of the heap is controlled by the operating system and expanded as necessary as you allocate more memory.
Yes, in the examples, a and array are both "stack" variables. The data they point to is not.
I put stack in quotes because we are not going to concern ourselves with hardware detail here, but just the semantics. They have the semantics of stack variables.
The chunks of heap memory which you allocate need to store some housekeeping data so that the allocator (the code which works in behind of new) could work. The data usually includes chunk length and the address of next allocated chunk, among other things — depending on the actual allocator.
In your case, the service data are stored directly in front of (and, maybe, behind of, too) the actual allocated chunk. This (plus, likely, alignment) is the reason of 16 byte gap you observe.
I am initializing millions of classes that are of the following type
template<class T>
struct node
{
//some functions
private:
T m_data_1;
T m_data_2;
T m_data_3;
node* m_parent_1;
node* m_parent_2;
node* m_child;
}
The purpose of the template is to enable the user to choose float or double precision, with the idea being that by node<float> will occupy less memory (RAM).
However, when I switch from double to float the memory footprint of my program does not decrease as I expect it to. I have two questions,
Is it possible that the compiler/operating system is reserving more space than required for my floats (or even storing them as a double). If so, how do I stop this happening - I'm using linux on 64 bit machine with g++.
Is there a tool that lets me determine the amount of memory used by all the different classes? (i.e. some sort of memory profiling) - to make sure that the memory isn't being goobled up somewhere else that I haven't thought of.
If you are compiling for 64-bit, then each pointer will be 64-bits in size. This also means that they may need to be aligned to 64-bits. So if you store 3 floats, it may have to insert 4 bytes of padding. So instead of saving 12 bytes, you only save 8. The padding will still be there whether the pointers are at the beginning of the struct or the end. This is necessary in order to put consecutive structs in arrays to continue to maintain alignment.
Also, your structure is primarily composed of 3 pointers. The 8 bytes you save take you from a 48-byte object to a 40 byte object. That's not exactly a massive decrease. Again, if you're compiling for 64-bit.
If you're compiling for 32-bit, then you're saving 12 bytes from a 36-byte structure, which is better percentage-wise. Potentially more if doubles have to be aligned to 8 bytes.
The other answers are correct about the source of the discrepancy. However, pointers (and other types) on x86/x86-64 are not required to be aligned. It is just that performance is better when they are, which is why GCC keeps them aligned by default.
But GCC provides a "packed" attribute to let you exert control over this:
#include <iostream>
template<class T>
struct node
{
private:
T m_data_1;
T m_data_2;
T m_data_3;
node* m_parent_1;
node* m_parent_2;
node* m_child;
} ;
template<class T>
struct node2
{
private:
T m_data_1;
T m_data_2;
T m_data_3;
node2* m_parent_1;
node2* m_parent_2;
node2* m_child;
} __attribute__((packed));
int
main(int argc, char *argv[])
{
std::cout << "sizeof(node<double>) == " << sizeof(node<double>) << std::endl;
std::cout << "sizeof(node<float>) == " << sizeof(node<float>) << std::endl;
std::cout << "sizeof(node2<float>) == " << sizeof(node2<float>) << std::endl;
return 0;
}
On my system (x86-64, g++ 4.5.2), this program outputs:
sizeof(node<double>) == 48
sizeof(node<float>) == 40
sizeof(node2<float>) == 36
Of course, the "attribute" mechanism and the "packed" attribute itself are GCC-specific.
In addtion to the valid points that Nicol makes:
When you call new/malloc, it doesn't necessarily correspond 1 to 1 with a call the the OS to allocate memory. This is because in order to reduce the number of expensive syste, calls, the heap manager may allocate more than is requested, and then "suballocate" chunks of that when you call new/malloc. Also, memory can only be allocated 4kb at a time (typically - this is the minimum page size). Essentially, there may be chunks of memory allocated that are not currently actively used, in order to speed up future allocations.
To answer your questions directly:
1) Yes, the runtime will very likely allocate more memory then you asked for - but this memory is not wasted, it will be used for future news/mallocs, but will still show up in "task manager" or whatever tool you use. No, it will not promote floats to doubles. The more allocations you make, the less likely this edge condition will be the cause of the size difference, and the items in Nicol's will dominate. For a smaller number of allocations, this item is likely to dominate (where "large" and "small" depends entirely on your OS and Kernel).
2) The windows task manager will give you the total memory allocated. Something like WinDbg will actually give you the virtual memory range chunks (usually allocated in a tree) that were allocated by the run-time. For Linux, I expect this data will be available in one of the files in the /proc directory associated with your process.