Stack size in QThread - c++

What is the maximal default stack size when using QThread in QT5 and C++? I have a QVector in my thread and I am calling myvector.append() method, and I`m interested how big my vector can be. I found uint QThread::stackSize() const method which returns Stack size, but only if it was previously changed by method setStackSize(), but what is the default Stack Size?

QThread stack size can be read and set by these two calls:
uint QThread::stackSize() const
void QThread::setStackSize(uint stackSize)
If Qthread is decade into OS specific thread in different OS, in linux pthread max stack size is 8M.
But sounds like you are concerned that QVector is growing in stack, this is not happening. QVector stores data in heap.
From source code of QVector,
void QVector<T>::append(const T &t)
{
...
if (!isDetached() || isTooSmall) {
...
reallocData(d->size, isTooSmall ? d->size + 1 : d->alloc, opt);
...
}
All it does it allocates new space in HEAP (pre-allocate predefined count of elements) and make sure data stored in continuous memory space. Doesn't looks like it concerns about page boundary etc.

The stack size only plays are role if you are compiling a 32 bit application and you're allocating the storage for your buffer explicitly on the stack. Just using an automatic QVector or std::vector instance doesn't allocate any large buffers on the stack - you'd need to use a custom allocator for that.
IIRC in 64 bit applications, the memory is laid out such that you won't ever run out of stack space for any reasonable number of threads.

Related

Is there some implicit conflict in initialize object using 'new' in __global__ functions? [duplicate]

As in title, can someone make sense for me more about heap and stack in CUDA? Does it have any different with original heap and stack in CPU memory?
I got a problem when I increase stack size in CUDA, it seem to have its limitation, because when I set stack size over 1024*300 (Tesla M2090) by cudaDeviceSetLimit, I got an error: argument invalid.
Another problem I want to ask is: when I set heap size to very large number (about 2GB) to allocate my RTree (data structure) with 2000 elements, I got an error in runtime: too many resources requested to launch
Any idea?
P/s: I launch with only single thread (kernel<<<1,1>>>)
About stack and heap
Stack is allocated per thread and has an hardware limit (see below).
Heap reside in global memory, can be allocated using malloc() and must be explicitly freed using free() (CUDA doc).
This device functions:
void* malloc(size_t size);
void free(void* ptr);
can be useful but I would recommend to use them only when they are really needed. It would be a better approach to rethink the code to allocate the memory using the host-side functions (as cudaMalloc).
The stack size has an hardware limit which can be computed (according to this answer by #njuffa) by the minimum of:
amount of local memory per thread
available GPU memory / number of SMs / maximum resident threads per SM
As you are increasing the size, and you are running only one thread, I guess your problem is the second limit, which in your case (TESLA M2090) should be: 6144/16/512 = 750KB.
The heap has a fixed size (default 8MB) that must be specified before any call to malloc() by using the function cudaDeviceSetLimit. Be aware that the memory allocated will be at least the size requested due to some allocation overhead.
Also it is worth mentioning that the memory limit is not per-thread but instead has the lifetime of the CUDA context (until released by a call to free()) and can be used by thread in a subsequent kernel launch.
Related posts on stack: ... stack frame for kernels, ... local memory per cuda thread
Related posts on heap: ... heap memory ..., ... heap memory limitations per thread
Stack and heap are different things. Stack represents the per thread stack, heap represents the per context runtime heap that device malloc/new uses to allocate memory. You set stack size with the cudaLimitStackSize flag, and runtime heap with the cudaLimitMallocHeapSize flag, both passed to the cudaDeviceSetLimit API.
It sounds like you are wanting to increase the heap size, but are trying to do so by changing the stack size. On the other hand, if you need a large stack size, you may have to reduce the number of threads per block you use in order to avoid kernel launch failures.

Memory Demands: Heap vs Stack in C++

So I had a strange experience this evening.
I was working on a program in C++ that required some way of reading a long list of simple data objects from file and storing them in the main memory, approximately 400,000 entries. The object itself is something like:
class Entry
{
public:
Entry(int x, int y, int type);
Entry(); ~Entry();
// some other basic functions
private:
int m_X, m_Y;
int m_Type;
};
Simple, right? Well, since I needed to read them from file, I had some loop like
Entry** globalEntries;
globalEntries = new Entry*[totalEntries];
entries = new Entry[totalEntries];// totalEntries read from file, about 400,000
for (int i=0;i<totalEntries;i++)
{
globalEntries[i] = new Entry(.......);
}
That addition to the program added about 25 to 35 megabytes to the program when I tracked it on the task manager. A simple change to stack allocation:
Entry* globalEntries;
globalEntries = new Entry[totalEntries];
for (int i=0;i<totalEntries;i++)
{
globalEntries[i] = Entry(.......);
}
and suddenly it only required 3 megabytes. Why is that happening? I know pointer objects have a little bit of extra overhead to them (4 bytes for the pointer address), but it shouldn't be enough to make THAT much of a difference. Could it be because the program is allocating memory inefficiently, and ending up with chunks of unallocated memory in between allocated memory?
Your code is wrong, or I don't see how this worked. With new Entry [count] you create a new array of Entry (type is Entry*), yet you assign it to Entry**, so I presume you used new Entry*[count].
What you did next was to create another new Entry object on the heap, and storing it in the globalEntries array. So you need memory for 400.000 pointers + 400.000 elements. 400.000 pointers take 3 MiB of memory on a 64-bit machine. Additionally, you have 400.000 single Entry allocations, which will all require sizeof (Entry) plus potentially some more memory (for the memory manager -- it might have to store the size of allocation, the associated pool, alignment/padding, etc.) These additional book-keeping memory can quickly add up.
If you change your second example to:
Entry* globalEntries;
globalEntries = new Entry[count];
for (...) {
globalEntries [i] = Entry (...);
}
memory usage should be equal to the stack approach.
Of course, ideally you'll use a std::vector<Entry>.
First of all, without specifying which column exactly you were watching, the number in task manager means nothing. On a modern operating system it's difficult even to define what you mean with "used memory" - are we talking about private pages? The working set? Only the stuff that stays in RAM? does reserved but not committed memory count? Who pays for memory shared between processes? Are memory mapped file included?
If you are watching some meaningful metric, it's impossible to see 3 MB of memory used - your object is at least 12 bytes (assuming 32 bit integers and no padding), so 400000 elements will need about 4.58 MB. Also, I'd be surprised if it worked with stack allocation - the default stack size in VC++ is 1 MB, you should already have had a stack overflow.
Anyhow, it is reasonable to expect a different memory usage:
the stack is (mostly) allocated right from the beginning, so that's memory you nominally consume even without really using it for anything (actually virtual memory and automatic stack expansion makes this a bit more complicated, but it's "true enough");
the CRT heap is opaque to the task manager: all it sees is the memory given by the operating system to the process, not what the C heap has "really" in use; the heap grows (requesting memory to the OS) more than strictly necessary to be ready for further memory requests - so what you see is how much memory it is ready to give away without further syscalls;
your "separate allocations" method has a significant overhead. The all-contiguous array you'd get with new Entry[size] costs size*sizeof(Entry) bytes, plus the heap bookkeeping data (typically a few integer-sized fields); the separated allocations method costs at least size*sizeof(Entry) (size of all the "bare elements") plus size*sizeof(Entry *) (size of the pointer array) plus size+1 multiplied by the cost of each allocation. If we assume a 32 bit architecture with a cost of 2 ints per allocation, you quickly see that this costs size*24+8 bytes of memory, instead of size*12+8 for the contiguous array in the heap;
the heap normally really gives away blocks that aren't really the size you asked for, because it manages blocks of fixed size; so, if you allocate single objects like that you are probably paying also for some extra padding - supposing it has 16 bytes blocks, you are paying 4 bytes extra per element by allocating them separately; this moves out memory estimation to size*28+8, i.e. an overhead of 16 bytes per each 12-byte element.

Increase stack size to use alloca()?

This is two questions overlapping- I wish to try out alloca() for large arrays instead of assigning dynamically-sized arrays on the heap. This is so that I can increase performance without having to make heap allocations. However, I get the impression stack sizes are usually quite small? Are there any disadvantages to increasing the size of my stack so that I can take full advantage of alloca()? Is it The more RAM I have, the larger I can proportionally increase my stack size?
EDIT1: Preferably Linux
EDIT2: I don't have a specified size in mind- I would rather know how to judge what determines the limit/boundaries.
Stack sizes are (by default) 8MB on most unix-y platforms, and 1MB on Windows (namely, because Windows has a deterministic way for recovering from out of stack problems, while unix-y platforms usually throw a generic SIGSEGV signal).
If your allocations are large, you won't see much of a performance difference between allocating on the heap versus allocating on the stack. Sure, the stack is slightly more efficient per allocation, but if your allocations are large the number of allocations is likely to be small.
If you want a larger stack-like structure you can always write your own allocator which obtains a large block from malloc and then handles allocation/deallocation in a stack-like fashion.
#include <stdexcept>
#include <cstddef>
class StackLikeAllocator
{
std::size_t usedSize;
std::size_t maximumSize;
void *memory;
public:
StackLikeAllocator(std::size_t backingSize)
{
memory = new char[backingSize];
usedSize = 0;
maximumSize = backingSize;
}
~StackLikeAllocator()
{
delete[] memory;
}
void * Allocate(std::size_t desiredSize)
{
// You would have to make sure alignment was correct for your
// platform (Exercise to the reader)
std::size_t newUsedSize = usedSize + desiredSize;
if (newUsedSize > maximumSize)
{
throw std::bad_alloc("Exceeded maximum size for this allocator.");
}
void* result = static_cast<void*>(static_cast<char*>(memory) + usedSize);
usedSize = newUsedSize;
return result;
}
// If you need to support deallocation then modifying this shouldn't be
// too difficult
}
The default stack size that the main thread of a program gets is a compiler-specific (and/or OS-specific) thing and you should see the appropriate documentation to find out how to enlarge the stack.
It may happen that you may be unable to enlarge the program's default stack to an arbitrarily large size.
You may, however, as it's been pointed out, be able to create a thread at run time with a stack of the size you want.
In any event, there's not much benefit of alloca() over a once allocated large buffer. You don't need to free and reallocate it many times.
The most important difference between alloca() and new / malloc() is that all memory allocated with alloca() will be gone when you return from the current function.
alloca() is only useful for small temporary data structures.
It is only useful for small data structures since big data structures will destroy the cache locality of your stack which will give you a rather big performance hit. The same goes for arrays as local variables.
Use alloca() only in very specific circumstances. If unsure, do not use it at all.
The general rule is: Do not put big data structures (>= 1k) on the stack. The stack does not scale. It is a very limited resource.
To answer the first question: The stack size is typically small relative to the heap size (this would hold true in most Linux applications).
If the allocations you are planning are large relative to the actual default stack size, then I think it would be better to use dynamic allocation from the heap (rather than trying to increase stack sizes). The cost of using the memory (filling it, reading it, manipulating it) is probably going to far exceed the cost of the allocation. It is unlikely that you would see a measurable benefit by allocating from the stack in this scenario.

C++ How to allocate memory dynamically on stack?

Is there a way to allocate memory on stack instead of heap? I can't find a good book on this, anyone here got an idea?
Use alloca() (sometimes called _alloca() or _malloca() ), but be very careful about it — it frees its memory when you leave a function, not when you go out of scope, so you'll quickly blow up if you use it inside a loop.
For example, if you have a function like
int foo( int nDataSize, int iterations )
{
for ( int i = 0; i < iterations ; ++i )
{
char *bytes = alloca( nDataSize );
// the memory above IS NOT FREED when we pass the brace below!
}
return 0;
} // alloca() memory only gets freed here
Then the alloca() will allocate an additional nDataSize bytes every time through the loop. None of the alloca() bytes get freed until you return from the function. So, if you have an nDataSize of 1024 and an iterations of 8, you'll allocate 8 kilobytes before returning. If you have an nDataSize= 65536 and iterations = 32768, you'll allocate a total 65536×32768=2,147,483,648 bytes, almost certainly blowing your stack and causing a crash.
anecdote: You can easily get into trouble if you write past the end of the buffer, especially if you pass the buffer into another function, and that subfunction has the wrong idea about the buffer's length. I once fixed a rather amusing bug where we were using alloca() to create temporary storage for rendering a TrueType font glyph before sending it over to GPU memory. Our font library didn't account for the diacritic in the Swedish Å character when calculating glyph sizes, so it told us to allocate n bytes to store the glyph before rendering, and then actually rendered n+128 bytes. The extra 128 bytes wrote into the call stack, overwriting the return address and inducing a really painful nondeterministic crash!
Since this is tagged C++, typically you just declare the objects you need in the correct scope. They are allocated on the stack, and guaranteed to be released on scope exit. This is RAII, and a critical advantage of C++ over C. No mallocs or news, and especially no allocas, required.
You can declare a local char[1024] or whatever number of bytes you'd like (up to a point), then take the address of the local for a pointer to this block of memory on the stack. Not exactly dynamic, but you could then wrap up this memory with your own memory manager if desired.
See _malloca.​​​​​​​​​​​​​​​ ​
Article discussing about dynamic memory allocation
We can allocate variable length space dynamically on stack memory by
using function
_alloca. This function allocates memory from the program stack. It simply takes number of bytes to be allocated and return void* to the
allocated space just as malloc call. This allocated memory will be
freed automatically on function exit.
So it need not to be freed explicitly. One has to keep in mind about
allocation size here, as stack overflow exception may occur. Stack
overflow exception handling can be used for such calls. In case of
stack overflow exception one can use _resetstkoflw() to restore it
back.
So our new code with _alloca would be :
int NewFunctionA()
{
char* pszLineBuffer = (char*) _alloca(1024*sizeof(char));
…..
// Program logic
….
//no need to free szLineBuffer
return 1;
}
When/if C++ allows the use of (non-static) const values for array bounds, it will be easier.
For now, the best way I know of is via recursion. There are all kinds of clever tricks that can be done, but the easiest I know of is to have your routine declare a fixed-sized array, and fill and operate on what it has. When its done, if it needs more space to finish, it calls itself.
You could use the BDE C++ library, e.g.
const int BUFFER_SIZE = 1024;
char buffer[BUFFER_SIZE];
bdlma::BufferedSequentialAllocator allocator(buffer, BUFFER_SIZE);
bsl::vector<int> dataVector(&allocator);
dataVector.resize(50);
BDE supplies comprehensive allocator options along with collections like bsl::vector that can use polymorphic allocators without changing the type of the container.
You might also consider:
https://github.com/facebook/folly/blob/master/folly/docs/small_vector.md
http://www.boost.org/doc/libs/1_55_0/doc/html/container/non_standard_containers.html#container.non_standard_containers.static_vector
http://llvm.org/docs/doxygen/html/classllvm_1_1SmallVector.html

Overriding global operator new to track huge memory allocations?

I am trying to produce a special build of a large monolithic application. The problem I am trying to solve is tracking hard-to-reproduce huge memory allocations (30-80 gigabytes, judging by what OS reports).
I believe the problem is an std::vector resized to a negative 32-bit integer value. The only platform exhibiting this behavior is Solaris (maybe it's the only platform that manages to successfully allocate such chunks of contiguous memory).
Can I globally replace std::vector with my class, delegating all calls to the real vector, watching for suspicious allocations (size > 0x7FFFFFFFu)? Maybe selectively replace the constructor that takes size_t and the resize() methods? Maybe even hijacking the global operator new?
Why not to do something like this?
void *operator new(size_t size)
{
// if (size > MAX_SIZE) ...
return malloc(size);
}
void *operator new [](size_t size)
{
// if (size > MAX_SIZE) ...
return malloc(size);
}
Setting a breakpoint in the if would find the problem right away.
You can provide a custom allocator on your vector at the time it's constructed.
You could just delegate to std::allocator, and firewall the requested memory size, in the first instance.
Take a look at the implementation of the std::vector class on the problem platform. Each implementation handles memory management differently (e.g. some double the currently allocated space when you add an object outside the vector's currently allocation size). If your objects are sufficiently large and/or you have a large number of entries being added to the vector, would be possible to attempt to allocate beyond the available (contiguous) memory on the computer. If that is the case, you'll want to look into a custom allocator for that vector.
If you're storing that many large items in a vector, you may want to look into another collection (e.g. std::list) or try storing pointers instead of actual objects.
You can supply your own allocator type to std::vector to track allocation.
But I doubt that's the reason. First, looking at the sizes (30-80GB) I conclude it's a 64-bit code. How could 32-bit negative integer value make it to vector size, which is 64-bit, it would have been promoted to 64-bit first to preserve value? Second, if this problem only occurs on Solaris then it can indicate a different problem. As far as I remember, Solaris is the only OS that commits memory on allocation, the other operating systems only mark the address space allocated until those memory pages are actually used. So I would search for unused allocations.