I am setting up my Cortex-M4 platform to use heap memory and encountering some issues.
I set heap region size to be 512 bytes, and it only allocates 9 bytes. Then I set heap to be 10kB and it can only allocate 362 bytes.
Here is my gcc stub:
int _sbrk(int a)
{
//align a to 4 bytes
if (a & 3)
{
a += (4 - (a & 0x3));
}
extern long __heap_start__;
extern long __heap_end__;
static char* heap_ptr = (char*)&__heap_start__;
if (heap_ptr + a < (char*)&__heap_end__)
{
int res = (int)heap_ptr;
heap_ptr += a;
return res;
}
else
{
return -1;
}
}
__heap_start__ and __heap_end__ are correct and their difference show correct region size.
I added debug in _sbrk function to see what a argument is passed when this function is called and the values of that argument are like these in each call respectively:
2552
1708
4096
What can I do to make it use full heap memory? And how _sbrk argument is calculated? Basically, what's wrong here?
Building C++ code, using new (std::nothrow).
EDIT
If I am using malloc (C style) it allocates 524 bytes and no _sbrk call before main, unlike when using operator new.
arm-none-eabi-g++.exe (GNU Tools for ARM Embedded Processors 6-2017-q2-update) 6.3.1 20170620 (release) [ARM/embedded-6-branch revision 249437]
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
EDIT2 Minimal Complete Verifiable Example
Here is my application code and _sbrk with info printing:
void foo()
{
while (true)
{
uint8_t * byte = new (std::nothrow) uint8_t;
if (byte)
{
DBB("Byte allocated");
cnt++;
}
else
{
DBB_ERROR("Allocated %d bytes", cnt);
}
}
}
int _sbrk(int a)
{
//align a to 4 bytes
if (a & 3)
{
a += (4 - (a & 0x3));
}
extern long __heap_start__;
extern long __heap_end__;
static char* heap_ptr = (char*)&__heap_start__;
DBB("%d 0x%08X", a, a);
DBB("0x%08X", heap_ptr);
DBB("0x%08X", &__heap_start__);
DBB("0x%08X", &__heap_end__);
if (heap_ptr + a < (char*)&__heap_end__)
{
int res = (int)heap_ptr;
heap_ptr += a;
DBB("OK 0x%08X 0x%08X", res, heap_ptr);
return res;
}
else
{
DBB("ERROR");
return -1;
}
}
And produced output is:
Your output reveals the C++ memory allocation system first asks for 32 bytes and then 132 bytes. It is then able to satisfy nine requests for new uint8_t with that space. Presumably it uses some of the 164 bytes for its internal record-keeping. This may involve keeping link lists or maps of which blocks are allocated, or some other data structure. Also, for efficiency, it likely does not track single-byte allocations but rather provides some minimum block size for each allocation, perhaps 8 or 16 bytes. When it runs out of space it needs, it asks for another 4096 bytes. Your sbrk then fails since this is not available.
The C++ memory allocation system is working as designed. In order to operate, it requires more space than is doled out for individual requests. In order to supply more memory for requests, you must provide more memory in the heap. You cannot expect a one-to-one correspondence, or any simple correspondence, between memory supplied from sbrk to the memory allocation system and memory supplied from the memory allocation system to its clients.
There is no way to tell the C++ memory allocation system to use “full heap memory” to satisfy requests to it. It is required to track dynamic allocations and releases of memory. Since its clients may make diverse size requests and may release them in any order, it needs to be able to track which blocks are currently allocated and which are not—a simple stack will not suffice. Therefore, it must use additional data structures to keep track of memory, and those data structures will consume space. So not all of the heap space can be given to clients; some of it must be used for overhead.
If the memory use of the memory allocation system in your C++ implementation is too inefficient for your purposes, you might replace it with one you write yourself or with third-party software. Any implementation of the memory allocation system makes various trade-offs about speed and block size, and those can be tailored to particular situations and goals.
Related
I'm making scientific code that calculates very many times(10+ hours), so the speed is far more important than any other thing.
case 1
class foo{
public:
double arr[4] = {0};
...
foo& operator = (foo&& other){
std::memcpy(arr, other.arr, sizeof(arr));
}
...
}
case 2
class fee{
public:
double *arr = nullptr;
fee(){
arr = new double[4];
}
~fee(){
if(arr != nullptr)
free[] arr;
}
...
&fee operator = (fee&& other){
arr = other.arr;
other.arr = nullptr;
}
...
}
These classes are used for vector(length 4) and matrix(size 4x4) calculations.
I heard that arrays of fixed size can be optimized by the compiler.
But in that case, r-value calculations can not be optimized(since all elements have to be copied instead of pointer switching).
A = B*C + D;
So my question is what is more expensive, memory allocation and freeing or copying close memories?
Or perhaps there is another way to increase the performance(such as making an expression class)?
First performance is not really a language question (except for algorythms used in the standard library) but more an implementation question. Anyway most common implementations use the program stack for automatic variables and a system heap for dynamic ones (allocated through new).
In that case the performance will depend on the usage. Heap management has a cost. So if you are frequently allocating and deallocating them, stack management should be the winner. But on the other side moving allocated data is just a matter of pointer exchange when you may need a memcpy for non allocated one.
The total memory has also a strong impact. Heap memory is only limited by the free system memory (at run time), while the stack size is normally defined at build time (link phase) and statically allocated at load time. So if the total size is only known at run time, use dynamic memory.
You are here trying to do low level optimization. The rule is then to do profiling. Build a small program making the expected usage of those structures, and use a profiling tool(*) with both implementations. I would alse give a try to a standard vector which has nice built-in optimizations.
(*) Beware, simply measuring the time of one single run is not accurate because it depends of many other parameters such as the load caused by other programs including system ones.
I have code written to create a linked list of dynamically created objects:
#include <iostream>
using namespace std;
struct X {
int i;
X* x;
};
void birth(X* head, int quant){
X* x = head;
for(int i=0;i<quant-1;i++){
x->i = i+1;
x->x = new X;
x = x->x;
}
x->i = quant;
x->x = 0;
}
void kill(X* x){
X* next;
while(1==1){
cout << x->i << endl;
cout << (long)x << endl;
next = x->x;
delete x;
if(next == 0){
break;
} else {
x = next;
}
}
}
int main(){
cout << (long)sizeof(X) << endl;
X* x = new X;
birth(x, 10);
kill(x);
return 0;
}
Which seems to be working, except for the fact that when you look at the addresses of each of the objects...
16
1
38768656
2
38768688
3
38768720
4
38768752
5
38768784
6
38768816
7
38768848
8
38768880
9
38768912
10
38768944
They seem to be created 32 bits apart despite the size of X being only 16 bits. Is there an issue with how I am creating the objects, or is this just a consequence of how dynamic allocation works?
The reason is stated in 7.22.3 Memory management functions of the C Standard:
The order and contiguity of storage allocated by successive calls to
the aligned_alloc, calloc, malloc, and realloc functions is
unspecified. The pointer returned if the allocation succeeds is
suitably aligned so that it may be assigned to a pointer to any type
of object with a fundamental alignment requirement and then used to
access such an object or an array of such objects in the space
allocated
Since the memory must be "suitably aligned so that it may be assigned to a pointer to any type of object with a fundamental alignment requirement", memory returned by malloc et al tends to start on distinct, platform-dependent multiples - usually 8- or 16-byte boundaries.
And because new is usually implemented with malloc, this applies to C++ new also.
Addresses of allocated memory blocks are controlled by the heap manager. Only the heap manager's interface is defined (new/delete, malloc/free), not its implementation. The application has to accept the provided addresses and work with them.
In other words, it is theoretically possible to implement a heap manager that allocates memory blocks at random-like addresses. The application, however, has to work equally well also in this case.
The new operator does not guarantee contiguous allocation. Here is a more convincing example:
#include <iostream>
int main()
{
for (int i = 0 ; i < 32 ; ++i)
std::cout << std::hex << new int() << std::endl;
}
Output on a 64bit CPU:
0x22cac20
0x22cac40
0x22cac60
0x22cac80
...
0x22cafe0
0x22cb000
Demo
You are working in an environment with 8 bytes of allocation overhead and minimum dynamic memory alignment of 16 bytes. So each 16 byte allocation has 8 bytes of allocation overhead and 8 bytes of alignment padding.
If you try again with a 24 byte object (making sure sizeof really is 24 not 32) you will find only 8 bytes of overhead and not an additional 8 bytes of alignment padding.
There is a minimum size (including overhead) of 32 bytes. So if you try with a tiny object, you get a total of 32, not 16. If you try with a 40 byte object, you get a total of 48 demonstrating the lack of 32 byte alignment.
That is all specific to the environment in which you are running. The C++ standard allows for a much wider range of possible behavior.
The 8 bytes immediately preceding the 16-byte aligned chunk returned by the allocator must hold the size of the allocation plus at least one status bit indicating whether the previous chunk is free. That is the minimum overhead a 64-bit allocator needs and while the chunk is in use it is all the overhead needed. But once a chunk is free, there is significant overhead at the beginning of the chunk to support consolidating adjacent free chunks and to support quickly finding a good size free chunk for new allocations. That overhead wouldn't fit if the total were just 16 bytes.
I am now reading the source code of OPENCV, a computer vision open source library. I am confused with this function:
#define CV_MALLOC_ALIGN 16
void* fastMalloc( size_t size )
{
uchar* udata = (uchar*)malloc(size + sizeof(void*) + CV_MALLOC_ALIGN);
if(!udata)
return OutOfMemoryError(size);
uchar** adata = alignPtr((uchar**)udata + 1, CV_MALLOC_ALIGN);
adata[-1] = udata;
return adata;
}
/*!
Aligns pointer by the certain number of bytes
This small inline function aligns the pointer by the certian number of bytes by
shifting it forward by 0 or a positive offset.
*/
template<typename _Tp> static inline _Tp* alignPtr(_Tp* ptr, int n=(int)sizeof(_Tp))
{
return (_Tp*)(((size_t)ptr + n-1) & -n);
}
fastMalloc is used to allocated memory for a pointer, which invoke malloc function and then alignPtr. I cannot understand well why alignPtr is called after memory is allocated? My basic understanding is by doing so it is much faster for the machine to find the pointer. Can some references on this issue be found in the internet? For modern computer, is it still necessary to perform this operation? Any ideas will be appreciated.
Some platforms require certain types of data to appear on certain byte boundaries (e.g:- some compilers
require pointers to be stored on 4-byte boundaries).
This is called alignment, and it calls for extra padding within, and possibly at the end of, the object's data.
Compiler might break in case they didn't find proper alignment OR there could be performance bottleneck in reading that data ( as there would be a need to read two blocks for getting same data).
EDITED IN RESPONSE TO COMMENT:-
Memory request by a program is generally handled by memory allocator. One such memory allocator is fixed-size allocator. Fixed size allocation return chunks of specified size even if requested memory is less than that particular size. So, with that background let me try to explain what's going on here:-
uchar* udata = (uchar*)malloc(size + sizeof(void*) + CV_MALLOC_ALIGN);
This would allocate amount of memory which is equal to memory_requested + random_size. Here random_size is filling up the gap to make it fit for size specified for fixed allocation scheme.
uchar** adata = alignPtr((uchar**)udata + 1, CV_MALLOC_ALIGN);
This is trying to align pointer to specific boundary as explained above.
It allocates a block a bit bigger than it was asked for.
Then it sets adata to the address of the next properly allocated byte (add one byte, then round up to the next properly aligned address).
Then it stores the original pointer before the new address. I assume this is later used to free the originally allocated block.
And then we return the new address.
This only makes sense if CV_MALLOC_ALIGN is a stricter alignment than malloc guarantees - perhaps a cache line?
Inspired by this question.
Apparently in the following code:
#include <Windows.h>
int _tmain(int argc, _TCHAR* argv[])
{
if( GetTickCount() > 1 ) {
char buffer[500 * 1024];
SecureZeroMemory( buffer, sizeof( buffer ) );
} else {
char buffer[700 * 1024];
SecureZeroMemory( buffer, sizeof( buffer ) );
}
return 0;
}
compiled with default stack size (1 megabyte) with Visual C++ 10 with optimizations on (/O2) a stack overflow occurs because the program tries to allocate 1200 kilobytes on stack.
The code above is of course slightly exaggerated to show the problem - uses lots of stack in a rather dumb way. Yet in real scenarios stack size can be smaller (like 256 kilobytes) and there could be more branches with smaller objects that would induce a total allocation size enough to overflow the stack.
That makes no sense. The worst case would be 700 kilobytes - it would be the codepath that constructs the set of local variables with the largest total size along the way. Detecting that path during compilation should not be a problem.
So the compiler produces a program that tries to allocate even more memory than the worst case. According to this answer LLVM does the same.
That could be a deficiency in the compiler or there could be some real reason for doing it this way. I mean maybe I just don't understand something in compilers design that would explain why doing allocation this way is necessary.
Why would the compiler want a program allocate more memory than the code needs in the worst case?
I can only speculate that this optimization was deemed too unimportant by the compiler designers. Or perhaps, there is some subtle security reason.
BTW, on Windows, stack is reserved in its entirety when the thread starts execution, but is committed on as-needed basis, so you are not really spending much "real" memory even if you reserved a large stack.
Reserving a large stack can be a problem on 32-bit system, where having large number of threads can eat the available address space without really committing much memory. On 64-bit, you are golden.
It could be down to your use of SecureZeroMemory. Try replacing it with regular ZeroMemory and see what happens- the MSDN page essentially indicates that SZM has some additional semantics beyond what it's signature implies, and they could be the cause of the bug.
The following code when compiled using GCC 4.5.1 on ideone places the two arrays at the same address:
#include <iostream>
int main()
{
int x;
std::cin >> x;
if (x % 2 == 0)
{
char buffer[500 * 1024];
std::cout << static_cast<void*>(buffer) << std::endl;
}
if (x % 3 == 0)
{
char buffer[700 * 1024];
std::cout << static_cast<void*>(buffer) << std::endl;
}
}
input: 6
output:
0xbf8e9b1c
0xbf8e9b1c
The answer is probably "use another compiler" if you want this optimization.
OS Pageing and byte alignment could be a factor. Also housekeeping may use extra stack along with space required for calling other functions within that function.
I am initializing millions of classes that are of the following type
template<class T>
struct node
{
//some functions
private:
T m_data_1;
T m_data_2;
T m_data_3;
node* m_parent_1;
node* m_parent_2;
node* m_child;
}
The purpose of the template is to enable the user to choose float or double precision, with the idea being that by node<float> will occupy less memory (RAM).
However, when I switch from double to float the memory footprint of my program does not decrease as I expect it to. I have two questions,
Is it possible that the compiler/operating system is reserving more space than required for my floats (or even storing them as a double). If so, how do I stop this happening - I'm using linux on 64 bit machine with g++.
Is there a tool that lets me determine the amount of memory used by all the different classes? (i.e. some sort of memory profiling) - to make sure that the memory isn't being goobled up somewhere else that I haven't thought of.
If you are compiling for 64-bit, then each pointer will be 64-bits in size. This also means that they may need to be aligned to 64-bits. So if you store 3 floats, it may have to insert 4 bytes of padding. So instead of saving 12 bytes, you only save 8. The padding will still be there whether the pointers are at the beginning of the struct or the end. This is necessary in order to put consecutive structs in arrays to continue to maintain alignment.
Also, your structure is primarily composed of 3 pointers. The 8 bytes you save take you from a 48-byte object to a 40 byte object. That's not exactly a massive decrease. Again, if you're compiling for 64-bit.
If you're compiling for 32-bit, then you're saving 12 bytes from a 36-byte structure, which is better percentage-wise. Potentially more if doubles have to be aligned to 8 bytes.
The other answers are correct about the source of the discrepancy. However, pointers (and other types) on x86/x86-64 are not required to be aligned. It is just that performance is better when they are, which is why GCC keeps them aligned by default.
But GCC provides a "packed" attribute to let you exert control over this:
#include <iostream>
template<class T>
struct node
{
private:
T m_data_1;
T m_data_2;
T m_data_3;
node* m_parent_1;
node* m_parent_2;
node* m_child;
} ;
template<class T>
struct node2
{
private:
T m_data_1;
T m_data_2;
T m_data_3;
node2* m_parent_1;
node2* m_parent_2;
node2* m_child;
} __attribute__((packed));
int
main(int argc, char *argv[])
{
std::cout << "sizeof(node<double>) == " << sizeof(node<double>) << std::endl;
std::cout << "sizeof(node<float>) == " << sizeof(node<float>) << std::endl;
std::cout << "sizeof(node2<float>) == " << sizeof(node2<float>) << std::endl;
return 0;
}
On my system (x86-64, g++ 4.5.2), this program outputs:
sizeof(node<double>) == 48
sizeof(node<float>) == 40
sizeof(node2<float>) == 36
Of course, the "attribute" mechanism and the "packed" attribute itself are GCC-specific.
In addtion to the valid points that Nicol makes:
When you call new/malloc, it doesn't necessarily correspond 1 to 1 with a call the the OS to allocate memory. This is because in order to reduce the number of expensive syste, calls, the heap manager may allocate more than is requested, and then "suballocate" chunks of that when you call new/malloc. Also, memory can only be allocated 4kb at a time (typically - this is the minimum page size). Essentially, there may be chunks of memory allocated that are not currently actively used, in order to speed up future allocations.
To answer your questions directly:
1) Yes, the runtime will very likely allocate more memory then you asked for - but this memory is not wasted, it will be used for future news/mallocs, but will still show up in "task manager" or whatever tool you use. No, it will not promote floats to doubles. The more allocations you make, the less likely this edge condition will be the cause of the size difference, and the items in Nicol's will dominate. For a smaller number of allocations, this item is likely to dominate (where "large" and "small" depends entirely on your OS and Kernel).
2) The windows task manager will give you the total memory allocated. Something like WinDbg will actually give you the virtual memory range chunks (usually allocated in a tree) that were allocated by the run-time. For Linux, I expect this data will be available in one of the files in the /proc directory associated with your process.