In Linux is it possible to start a process (e.g. with execve) and make it use a particular memory region as stack space?
Background:
I have a C++ program and a fast allocator that gives me "fast memory". I can use it for objects that make use of the heap and create them in fast memory. Fine. But I also have a lot of variable living on the stack. How can I make them use the fast memory as well?
Idea: Implement a "program wrapper" that allocates fast memory and then starts the actual main program, passing a pointer to the fast memory and the program uses it as stack. Is that possible?
[Update]
The pthread setup seems to work.
With pthreads, you could use a secondary thread for your program logic, and set its stack address using pthread_attr_setstack():
NAME
pthread_attr_setstack, pthread_attr_getstack - set/get stack
attributes in thread attributes object
SYNOPSIS
#include <pthread.h>
int pthread_attr_setstack(pthread_attr_t *attr,
void *stackaddr, size_t stacksize);
DESCRIPTION
The pthread_attr_setstack() function sets the stack address and
stack size attributes of the thread attributes object referred
to by attr to the values specified in stackaddr and stacksize,
respectively. These attributes specify the location and size
of the stack that should be used by a thread that is created
using the thread attributes object attr.
stackaddr should point to the lowest addressable byte of a buf‐
fer of stacksize bytes that was allocated by the caller. The
pages of the allocated buffer should be both readable and
writable.
What I don't follow is how you're expecting to get any performance improvements out of doing something like this (I assume the purpose of your "fast" memory is better performance).
Related
What is the easiest way to increase the size of the default stacksize for pthreads? Is there any way to tune the heapsize as well, process level and individual thread level? If new operator fails because of underlying memory leakage, how do I set new handler to take care of bad allocations?
What is the easiest way to increase the size of the default stacksize for pthreads?
You can use pthread_attr_setstacksize to set the stacksize when creating new threads. The stack size must not be smaller than PTHREAD_STACK_MIN.
Is there any way to tune the heapsize as well, process level and individual thread level?
Using the Solaris compiler you can try to change the pagesize using the -xpagesize option, but you cannot adjust the size of the heap (it will be as large as the memory available to the machine). There is only one heap shared by all threads, so you cannot adjust it per-thread.
If new operator fails because of underlying memory leakage, how do I set new handler to take care of bad allocations?
The new handler is a specialized feature and there is no general answer, how to use a new handler is very dependent on the details of your program. It can't be used to fix memory leaks, once the memory is leaked it's too late so you need to prevent leaks from happening in the first place. (And if you don't know how to write a new handler then you probably don't need to use one.)
I have a hypothesis here, but it's a little tough to verify.
Is there a unique stack frame for each calling thread when two threads invoke the same method of the same object instance? In a compiled binary, I understand a class to be a static code section filled with function definitions in memory and the only difference between different objects is the this pointer which is passed beneath the hood.
But therefore the thread calling it must have its own stack frame, or else two threads trying to access the same member function of the same object instance, would be corrupting one another's local variables.
Just to reiterate here, I'm not referring to whether or not two threads can corrupt the objects data by both modifying this at the same time, I'm well aware of that. I'm more getting at whether or not, in the case that two threads enter the same method of the same instance at the same time, whether or not the local variables of that context are the same places in memory. Again, my assumption is that they are not.
You are correct. Each thread makes use of its own stack and each stack makes local variables distinct between threads.
This is not specific to C++ though. It's just the way processors function. (That is in modern processors, some older processors had only one stack, like the 6502 that had only 256 bytes of stack and no real capability to run threads...)
Objects may be on the stack and shared between threads and thus you can end up modifying the same object on another thread stack. But that's only if you share that specific pointer.
you are right that different threads have unique stacks. That is not a feature of c++ or cpp but something provided by the OS. class objects won't necessary be different. This depends on how they are allocated. Different threads could share heap objects which might lead to concurrent problem.
Local variables of any function or class method are stored in each own stack (actually place in thread's stack, stack frame), so it is doesn't matter from what thread you're calling method - it will use it's own stack during execution for each call
a little different explanation: each method call creates its own stack (or better stack frame)
NOTE: static variables will be the same
of course there exists techniques to get access to another's method's stack memory during execution, but there are kinda hacks
I am planning to write a C++ networked application where:
I use a single thread to accept TCP connections and also to read data from them. I am planning to use epoll/select to do this. The data is written into buffers that are allocated using some arena allocator say jemalloc.
Once there is enough data from a single TCP client to form a protocol message, the data is published on a ring buffer. The ring buffer structures contain the fd for the connection and a pointer to the buffer containing the relevant data.
A worker thread processes entries from the ring buffers and sends some result data to the client. After processing each event, the worker thread frees the actual data buffer to return it to the arena allocator for re use.
I am leaving out details on how the publisher makes data written by it visible to the worker thread.
So my question is: Are there any allocators which optimize for this kind of behavior i.e. allocating objects on one thread and freeing on another?
I am worried specifically about having to use locks to return memory to an arena which is not the thread affinitized arena. I am also worried about false sharing since the producer thread and the worker thread will both write to the same region. Seems like jemalloc or tcmalloc both don't optimize for this.
Before you go down the path of implementing a highly optimized allocator for your multi-threaded application, you should first just use the standard new and delete operators for your implementation. After you have a correct implementation of your application, you can move to address bottlenecks that are discovered through profiling it.
If you get to the stage where it is obvious that the standard new and delete allocators are a bottleneck to the application, the following is the approach I have used:
Assumption: The number of threads are fixed and are statically created.
Each thread has their own arena.
Each object taken from an arena has a reference back to the arena it came from.
Each arena has a separate garbage list for each thread.
When a thread frees an object, it goes back the arena it came from, but is placed in the thread specific garbage list.
The thread that actually owns the arena treats its garbage list as the real free list.
Periodically, the thread that owns an arena performs a garbage collection pass to fold objects from the other thread garbage lists into the real free list.
The "periodical" garbage collection pass doesn't necessarily have to be time based. A subset of the garbage could be reaped on every allocation and free, for example.
The best way to deal with memory allocation and deallocation issues is to not deal with it.
You mention a ring buffer. Those are usually a fixed size. If you can come up with a fixed maximum size for your protocol messages you can allocate all the memory you will ever need at program start. When deallocating, keep the memory but reset it to a fresh state.
Now, your program may need to allocate and deallocate memory while dealing with each message but that will be done in each thread and cross-thread issues will not come into play.
This can work even if your message maximum size is too large to preallocate if you can allocate the amount of memory that most messages will use and have handlers for allocating more when necessary.
I want to increase the stack size of a thread created through pthread_create(). The way to go seems to be
int pthread_attr_setstack( pthread_attr_t *attr,
void *stackaddr,
size_t stacksize );
from pthread.h.
However, according to multiple online references,
The stackaddr shall be aligned appropriately to be used as a stack; for example, pthread_attr_setstack() may fail with [EINVAL] if ( stackaddr & 0x7) is not 0.
My question: could someone provide an example of how to perform the alignment? Is it (the alignment) platform or implementation dependent?
Thanks in advance
Never use pthread_attr_setstack. It has a lot of fatal flaws, the worst of which is that it is impossible to ever free or reuse the stack after a thread has been created using it. (POSIX explicitly states that any attempt to do so results in undefined behavior.)
POSIX provides a much better function, pthread_attr_setstacksize which allows you to request the stack size you need, but leaves the implementation with the responsibility for allocating and deallocating the stack.
Look into posix_memalign().
It will allocate a memory block of the requested alignment and size.
I am using malloc_stats() to print malloc related statistics in which I am finding "Arena 0" for some programs and "Arena 0 and Arena 1" for some other programs.
What do these arenas represent?
The heap code resides inside the glibc component, and is packaged in the libc.so.x shared library. The current implementation of the heap uses multiple independent sub-heaps called arenas. Each arena has its own mutex for concurrency protection. Thus if there are sufficient arenas within a process' heap, and a mechanism to distribute the threads' heap accesses evenly between them, then the potential for contention for the mutexes should be minimal. It turns out that this works well for allocations. In malloc(), a test is made to see if the mutex for current target arena for the current thread is free (trylock). If so then the arena is now locked and the allocation proceeds. If the mutex is busy then each remaining arena is tried in turn and used if the mutex is not busy. In the event that no arena can be locked without blocking, a fresh new arena is created. This arena by definition is not already locked, so the allocation can now proceed without blocking. Lastly, the ID of the arena last used by a thread is retained in thread local storage, and subsequently used as the first arena to try when malloc() is next called by that thread. Therefore all calls to malloc() will proceed without blocking.
See link text. It looks like heap is a collection of arenas ("sub-heaps") to handle memory allocation between several threads, thus reducing contention.
In certain malloc implementations, an "arena" is a pool of memory from which individual allocations are made. The algorithms to determine which arena is used will differ between implementations, so it's not possible for us to explain why you see a difference. One common factor is allocation size.
Everything is there: http://www.gnu.org/software/libc/manual/html_node/Statistics-of-Malloc.html
int arena
This is the total size of memory allocated with sbrk by malloc, in bytes.