I am trying to understand a program which uses multi-threading with shared-memory. The parent thread calls the following function and I don't quite understand how it works.
#define MAX_STACK_SIZE 16384 // 16KB of stack
/*!
* Writes to a 16 KB buffer on the stack. If we are using 4K pages for our
* stack, this will make sure that we won't have a page fault when the stack
* grows. Also mlock's all pages associated with the current process, which
* prevents the program from being swapped out. If we do run out of
* memory, the robot program will be killed by the OOM process killer (and
* leaves a log) instead of just becoming unresponsive.
*/
void HardwareBridge::prefaultStack() {
printf("[Init] Prefault stack...\n");
volatile char stack[MAX_STACK_SIZE];
memset(const_cast<char*>(stack), 0, MAX_STACK_SIZE);
if (mlockall(MCL_CURRENT | MCL_FUTURE) == -1) {
initError(
"mlockall failed. This is likely because you didn't run robot as "
"root.\n",
true);
}
}
//Parent Thread
void HardwareBridge::run(){
printf("[HardwareBridge] Init stack\n");
prefaultStack();
//printf("[HardwareBridge] Init scheduler\n"); // Commented because unrelated to current question
//setupScheduler();
// Calls multiple threads here
for(;;){
usleep(10000000);
}
}
Can someone explain what's the purpose of this function. Based on the comment, I could understand that it prevent the stack size from growing beyond 16KB. However, the shared memory are predominantly allocated dynamically using new keyword in the program. Isn't dynamic memory allocation takes place in the heap rather than the stack? How does the function helps in this scenario.
Based on the comment, I could understand that it prevent the stack size from growing beyond 16KB.
That's not what the comment says and not what the function does.
Can someone explain what's the purpose of this function.
The comment explains it. The function does two things:
It pre-allocates 16K of stack.
It "locks" the allocated memory which prevents it from being swapped to disk.
These two things guarantee that there won't be a page fault when the stack usage grows (as long as it doesn't grow beyond 16K).
However, the shared memory are predominantly allocated dynamically
True. This means that shared, or other dynamic memory allocation is irrelevant to the function.
In prefaultStack() you make sure that stack is incremented at least 16K from its previous size. Then by calling mlockall() you lock current memory pages to memory, preventing them from being swapped out. After that you exit the function.
I would say that the only real effect of this is that you ensure that no matter what, the calling thread will have at least 16K of stack available even if later on some hungry process eats up all remaining memory.
as per my understanding, the reply is inline
Based on the comment, I could understand that it prevent the stack size from growing beyond 16KB.
Nope. prefaultStack function has char[MAX_STACK_SIZE] on its stack, so stack segment of main process will be of size 4 pages + (stack size for main function). And any virtual memory pages of this process (along with any allocated in future as stack or heap grows) will not be swapped out to the swap area because mlockall is called with MCL_CURRENT and MCL_FUTURE https://linux.die.net/man/2/mlockall. This is the only functionality of this function. Nothing related to dynamic memory, heap or shared memory.
However, the shared memory are predominantly allocated dynamically using new keyword in the program.
you are dealing with the multi-threading, so dynamic memory or heap address space is shared among the threads. This code does nothing with respect to shared memory between two processes.
Isn't dynamic memory allocation takes place in the heap rather than the stack? How does the function helps in this scenario.
Dynamic memory is allocated from the heap. And this function does not deal
in any way with heap. This code only makes sure that all the stack and heap pages (CURRENT allocated and FUTURE allocated) of the process never swapped out to the swap area preventing any page faults which are time-consuming costlier operations.
Related
As in title, can someone make sense for me more about heap and stack in CUDA? Does it have any different with original heap and stack in CPU memory?
I got a problem when I increase stack size in CUDA, it seem to have its limitation, because when I set stack size over 1024*300 (Tesla M2090) by cudaDeviceSetLimit, I got an error: argument invalid.
Another problem I want to ask is: when I set heap size to very large number (about 2GB) to allocate my RTree (data structure) with 2000 elements, I got an error in runtime: too many resources requested to launch
Any idea?
P/s: I launch with only single thread (kernel<<<1,1>>>)
About stack and heap
Stack is allocated per thread and has an hardware limit (see below).
Heap reside in global memory, can be allocated using malloc() and must be explicitly freed using free() (CUDA doc).
This device functions:
void* malloc(size_t size);
void free(void* ptr);
can be useful but I would recommend to use them only when they are really needed. It would be a better approach to rethink the code to allocate the memory using the host-side functions (as cudaMalloc).
The stack size has an hardware limit which can be computed (according to this answer by #njuffa) by the minimum of:
amount of local memory per thread
available GPU memory / number of SMs / maximum resident threads per SM
As you are increasing the size, and you are running only one thread, I guess your problem is the second limit, which in your case (TESLA M2090) should be: 6144/16/512 = 750KB.
The heap has a fixed size (default 8MB) that must be specified before any call to malloc() by using the function cudaDeviceSetLimit. Be aware that the memory allocated will be at least the size requested due to some allocation overhead.
Also it is worth mentioning that the memory limit is not per-thread but instead has the lifetime of the CUDA context (until released by a call to free()) and can be used by thread in a subsequent kernel launch.
Related posts on stack: ... stack frame for kernels, ... local memory per cuda thread
Related posts on heap: ... heap memory ..., ... heap memory limitations per thread
Stack and heap are different things. Stack represents the per thread stack, heap represents the per context runtime heap that device malloc/new uses to allocate memory. You set stack size with the cudaLimitStackSize flag, and runtime heap with the cudaLimitMallocHeapSize flag, both passed to the cudaDeviceSetLimit API.
It sounds like you are wanting to increase the heap size, but are trying to do so by changing the stack size. On the other hand, if you need a large stack size, you may have to reduce the number of threads per block you use in order to avoid kernel launch failures.
I have the following loop, which pops a C++ concurrent queue I have, from the implementation here. https://juanchopanzacpp.wordpress.com/2013/02/26/concurrent-queue-c11/
while (!interrupted)
{
pxData data = queue->pop();
if (data.value == -1)
{
break; // exit loop on terminating condition
}
usleep(7000); // stub to simulate processing
}
I am looking at the memory history using System Monitor in CentOS7.
I'm trying to free up the memory taken up by the queue, after reading the value from the queue. However, as the following while loop runs, I don't see the memory usage going down. I've verified that the queue length does go down.
It does go down, however, when -1 is encountered and the loop exits. (program is still running) But I can't have this, because where usleep is, I want to do some intensive processing.
Question: Why doesn't the memory occupied by data get free-ed? (according to System Monitor) Isn't the stack allocated memory supposed to be free-ed when the variable goes out of scope?
The struct is defined as follows, and populated at the beginning of the program.
typedef struct pxData
{
float value; // -1 value terminates the loop
float x, y, z;
std::complex<float> valueData[65536];
} pxData;
It's populated with ~10000 pxData, which roughly translates to 5GB. System only has ~8GB.
So it's important that the memory is free-ed up for doing other processing in the system.
There are a few things at play here.
Virtual Memory
First, you need to understand that just because your program is "using" 5 GB of memory does not mean that there are only 3 GB of RAM left for other programs. Virtual memory means that those 5 GB might be only 1 GB of actual "resident" data, and the other 4 GB may actually be on disk rather than in RAM. So it's important to look at the "resident set size" rather than the "virtual size" when you're looking at your program. And note that if your system actually runs low on RAM, the OS may shrink the RSS of some programs by "paging out" some of their memory. So don't worry too much about "5 GB" appearing in the system monitor--worry if you have a real, concrete performance problem.
Heap Allocation
The second aspect is why your virtual size does not decrease as you remove items from the queue. We can guess that you put those elements into the queue by creating them with malloc or new one-by-one, then pushing them onto the back of the queue. This means that the first element you allocated will come out of the queue first. And that in turn means that when you have drained 90% of the queue, your memory allocation might look like this:
[program|------------------unused-------------------|pxData]
The problem here is that in the real world, just because you free or delete something does not mean the operating system instantly reclaims that memory. In fact, it may not be able to reclaim any unused spans unless they are at the "end" (i.e. most recently allocated). Since C++ does not have garbage collection and cannot move items around in memory without your consent, you end up with this big "hole" in your program's virtual memory. That hole would be used to satisfy future memory allocation requests, but if you haven't got any, it just sits there, until the queue is completely empty:
[program|------------------unused--------------------------]
Then the system is able to shrink your virtual address space back down:
[program]
Which brings you back to where you started.
Solutions
If you want to "fix" this, one option is to allocate your memory in "reverse", i.e. put the last items allocated into the front of the queue.
Another option is to allocate the elements for the queue via mmap, which is something that e.g. Linux will do automatically for allocations which are "large." You can change the threshold for this by calling mallopt(3) with M_MMAP_THRESHOLD and setting it to be a little bit smaller than your struct size. This makes the allocations independent of each other, so the OS can reclaim them individually. This technique can even be applied to existing programs without recompilation, so is often useful if you need to solve this problem in a program you cannot modify.
A C++ implementation would call some operator delete to release dynamically allocated (using some operator new) memory. In several C++ standard libraries, new calls malloc and delete calls free.
(I am focusing with a Linux point of view, but the principles are similar on other OSes)
But while malloc (or ::operator new) is sometimes asking the OS kernel some more memory by system calls changing the virtual address space like mmap(2), free (or ::operator delete) is often simply marking the released memory zone as re-available to future calls to malloc (or to new)
So from the kernel point of view (e.g. as seen thru /proc/, see proc(5)...), the virtual address space is not changing, and the memory remains consumed, even if inside the application it is marked as "freed" and will be reused at some future allocation (by future calls to malloc or new)
And most C++ standard containers are internally using heap data. In particular your local (stack-allocated) std::map or std::vector (or std::deque) variable will call new & delete for internal data.
BTW, I find quite strange your declaration. Unless every struct pxData has exactly 65536 used valueData slots, I would suggest to use some std::vector so have
std::vector<std::complex<float>> valueData;
and improve your code accordingly. You'll probably need to do some valueData.reserve(somesize); and/or valueData.resize(somesize); and/or valueData.push_back(somecomplexnumber); etc....
I understand that if you have a multithreaded application, and you need to allocate a lot of memory, then you should allocate on heap. Stack space is divided up amongst threads of your application, thus the size of stack for each thread gets smaller as you create new threads. Thus, if you tried to allocate lots of memory on stack, it could overflow. But, assuming that you have a single-threaded application, is the stack size essentially the same as that for heap?
I read elsewhere that stack and heap don't have a clearly defined boundary in the address space, rather that they grow into each other.
P.S. Lifetime of the objects being allocated is not an issue. The objects gets created first thing in the program, and gets cleaned at exit. I don't have to worry about it going out of scope, and thus getting cleaned from stack space.
No, stack size is not the same as heap. Stack objects get pushed/popped in a LIFO manner, and used for things such as program flow. For example, arguments are "pushed" into the stack before a function call, then "popped" into function arguments to be accessed. Recursion therefore, uses a lot of stack space if you go too deep. Heap is really for pointers and allocated memory. In the real world, the stack is like the gears in your clock, and the heap is like your desk. Your clock sits on your desk, because it takes up room - but you use it for something completely different than your desk.
Check out this question on Stack Overflow:
Why is memory split up into stack and heap?'
I am a student of a system software faculty. Now I'm developing a memory manager for Windows. Here's my simple implementation of malloc() and free():
HANDLE heap = HeapCreate(0, 0, 0);
void* hmalloc(size_t size)
{
return HeapAlloc(heap, 0, size);
}
void hfree(void* memory)
{
HeapFree(heap, 0, memory);
}
int main()
{
int* ptr1 = (int*)hmalloc(100*sizeof(int));
int* ptr2 = (int*)hmalloc(100*sizeof(int));
int* ptr3 = (int*)hmalloc(100*sizeof(int));
hfree(ptr2);
hfree(ptr3);
hfree(ptr1);
return 0;
}
It works fine. But I can't understand is there a reason to use multiple heaps? Well, I can allocate memory in the heap and get the address to an allocated memory chunk. But here I use ONE heap. Is there a reason to use multiple heaps? Maybe for multi-threaded/multi-process applications? Please explain.
The main reason for using multiple heaps/custom allocators are for better memory control. Usually after lots of new/delete's the memory can get fragmented and loose performance for the application (also the app will consume more memory). Using the memory in a more controlled environment can reduce heap fragmentation.
Also another usage is for preventing memory leaks in the application, you could just free the entire heap you allocated and you don't need to bother with freeing all the object allocated there.
Another usage is for tightly allocated objects, if you have for example a list then you could allocate all the nodes in a smaller dedicated heap and the app will gain performance because there will be less cache misses when iterating the nodes.
Edit: memory management is however a hard topic and in some cases it is not done right. Andrei Alexandrescu had a talk at one point and he said that for some application replacing the custom allocator with the default one increased the performance of the application.
This is a good link that elaborates on why you may need multiple heap:
https://caligari.dartmouth.edu/doc/ibmcxx/en_US/doc/libref/concepts/cumemmng.htm
"Why Use Multiple Heaps?
Using a single runtime heap is fine for most programs. However, using multiple
heaps can be more efficient and can help you improve your program's performance
and reduce wasted memory for a number of reasons:
1- When you allocate from a single heap, you may end up with memory blocks on
different pages of memory. For example, you might have a linked list that
allocates memory each time you add a node to the list. If you allocate memory for
other data in between adding nodes, the memory blocks for the nodes could end up
on many different pages. To access the data in the list, the system may have to
swap many pages, which can significantly slow your program.
With multiple heaps, you can specify which heap you allocate from. For example,
you might create a heap specifically for the linked list. The list's memory blocks
and the data they contain would remain close together on fewer pages, reducing the
amount of swapping required.
2- In multithread applications, only one thread can access the heap at a time to
ensure memory is safely allocated and freed. For example, say thread 1 is
allocating memory, and thread 2 has a call to free. Thread 2 must wait until
thread 1 has finished its allocation before it can access the heap. Again, this
can slow down performance, especially if your program does a lot of memory
operations.
If you create a separate heap for each thread, you can allocate from them
concurrently, eliminating both the waiting period and the overhead required to
serialize access to the heap.
3- With a single heap, you must explicitly free each block that you allocate. If you
have a linked list that allocates memory for each node, you have to traverse the
entire list and free each block individually, which can take some time.
If you create a separate heap for that linked list, you can destroy it with a
single call and free all the memory at once.
4- When you have only one heap, all components share it (including the IBM C and
C++ Compilers runtime library, vendor libraries, and your own code). If one
component corrupts the heap, another component might fail. You may have trouble
discovering the cause of the problem and where the heap was damaged.
With multiple heaps, you can create a separate heap for each component, so if
one damages the heap (for example, by using a freed pointer), the others can
continue unaffected. You also know where to look to correct the problem."
A reason would be the scenario that you need to execute a program internally e.g. running simulation code. By creating your own heap you could allow that heap to have execution rights which by default for security reasons is turned off. (Windows)
You have some good thoughts and this'd work for C but in C++ you have destructors, it is VERY important they run.
You can think of all types as having constructors/destructors, just that logically "do nothing".
This is about allocators. See "The buddy algorithm" which uses powers of two to align and re-use stuff.
If I allocate 4 bytes somewhere, my allocator might allocate a 4kb section just for 4 byte allocations. That way I can fit 1024 4 byte things in the block, if I need more add another block and so forth.
Ask it for 4kb and it wont allocate that in the 4byte block, it might have a separate one for larger requests.
This means you can keep big things together. If I go 17 bytes then 13 bytes the 1 byte and the 13byte gets freed, I can only stick something in there of <=13 bytes.
Hence the buddy system and powers of 2, easy to do using lshifts, if I want a 2.5kb block, I allocate it as the smallest power of 2 that'll fit (4kb in this case) that way I can use the slot afterwards for <=4kb items.
This is not for garbage collection, this is just keeping things more compact and neat, using your own allocator can stop calls to the OS (depending on the default implementation of new and delete they might already do this for your compiler) and make new/delete very quick.
Heap-compacting is very different, you need a list of every pointer that points to your heap, or some way to traverse the entire memory graph (like spits Java) so when you move stuff round and "compact" it you can update everything that pointed to that thing to where it currently is.
The only time I ever used more than one heap was when I wrote a program that would build a complicated data structure. It would have been non-trivial to free the data structure by walking through it and freeing the individual nodes, but luckily for me the program only needed the data structure temporarily (while it performed a particular operation), so I used a separate heap for the data structure so that when I no longer needed it, I could free it with one call to HeapDestroy.
Whilst asking another question (and also before) I was wondering how do I judge whether to create an object on the heap or keep it as an object on the stack? What should I ask myself about the object to make the correct allocation?
Put it on the heap if you have to, the stack if you can.
What kinds of things do you need to put on the heap? Anything of varying length. Any object that might need to be null. Anything that's very large, lest you cause a stack overflow.
Simple answer.
When it goes out of scope, do you want it to hang around and be able to use it?
Depends on intended lifetime of the object.
If you want the object to be alive even after function returns, then HEAP, else STACK
If an object is placed in the HEAP, then it must be explicitly free()'ed or deleted by the programmer, once its usage is over; otherwise the program will be leaking memory.
Stack memory is fast. It is fast because (a) there is no system overhead to allocate the memory - the allocation is done by simply moving the stack pointer in one instruction and (b) the memory in the stack is "hot" so it is already in cache. Heap memory is slow because (a) it requires a lot of system work to look around and find a free chunk of memory and (b) is probably not in cache and will require evicting some data you might have wanted.
Stack memory doesn't get fragmented. It is possible that a heap eventually gets so fragmented, you can't allocate anything (even though ironically there is still enough unused memory!)
For long lived data and for large data (multi KB or more), you have to use a heap.
The danger of allocating a bigger stack is that it might hurt you if are running multiple threads. You have to size the stack for the "worst case" usage. Each thread requires its own stack. On a high core count machine (where you might have 200+ threads running), you may not want to arbitrarily increase the stack. The heap on the other hand does not need to be sized for "worst case" usage - it is much more efficient.
Two reasons to use the heap:
1- You want the data after the current scope.
2- You want to reserve large memory.
Other than that stay on stack.
Note: don't reserve a lot of memory on the stack, or you'll get a "Stack-overflow" ;)