Good allocator for cross thread allocation and free - c++

I am planning to write a C++ networked application where:
I use a single thread to accept TCP connections and also to read data from them. I am planning to use epoll/select to do this. The data is written into buffers that are allocated using some arena allocator say jemalloc.
Once there is enough data from a single TCP client to form a protocol message, the data is published on a ring buffer. The ring buffer structures contain the fd for the connection and a pointer to the buffer containing the relevant data.
A worker thread processes entries from the ring buffers and sends some result data to the client. After processing each event, the worker thread frees the actual data buffer to return it to the arena allocator for re use.
I am leaving out details on how the publisher makes data written by it visible to the worker thread.
So my question is: Are there any allocators which optimize for this kind of behavior i.e. allocating objects on one thread and freeing on another?
I am worried specifically about having to use locks to return memory to an arena which is not the thread affinitized arena. I am also worried about false sharing since the producer thread and the worker thread will both write to the same region. Seems like jemalloc or tcmalloc both don't optimize for this.

Before you go down the path of implementing a highly optimized allocator for your multi-threaded application, you should first just use the standard new and delete operators for your implementation. After you have a correct implementation of your application, you can move to address bottlenecks that are discovered through profiling it.
If you get to the stage where it is obvious that the standard new and delete allocators are a bottleneck to the application, the following is the approach I have used:
Assumption: The number of threads are fixed and are statically created.
Each thread has their own arena.
Each object taken from an arena has a reference back to the arena it came from.
Each arena has a separate garbage list for each thread.
When a thread frees an object, it goes back the arena it came from, but is placed in the thread specific garbage list.
The thread that actually owns the arena treats its garbage list as the real free list.
Periodically, the thread that owns an arena performs a garbage collection pass to fold objects from the other thread garbage lists into the real free list.
The "periodical" garbage collection pass doesn't necessarily have to be time based. A subset of the garbage could be reaped on every allocation and free, for example.

The best way to deal with memory allocation and deallocation issues is to not deal with it.
You mention a ring buffer. Those are usually a fixed size. If you can come up with a fixed maximum size for your protocol messages you can allocate all the memory you will ever need at program start. When deallocating, keep the memory but reset it to a fresh state.
Now, your program may need to allocate and deallocate memory while dealing with each message but that will be done in each thread and cross-thread issues will not come into play.
This can work even if your message maximum size is too large to preallocate if you can allocate the amount of memory that most messages will use and have handlers for allocating more when necessary.

Related

Sharing heap memory with fork()

I am working on implementing a database server in C that will handle requests from multiple clients. I am using fork() to handle connections for individual clients.
The server stores data in the heap, which consists of a root pointer to hash tables of dynamically allocated records. The records are structs that have pointers to various data-types. I would like for the processes to be able to share this data so that, when a client makes a change to the heap, the changes will be visible for the other clients.
I have learned that fork() uses COW (Copy On Write), and my understanding is that it copies the heap (and stack) memory of the parent process when the child tries to modify the data in memory.
I have found out that I can use the shm library to share memory.
Would the code below be a valid way to share heap memory (in shared_string)? If a child were to use similar code (i.e. starting from //start), would other children be able to read/write to it while the child is running and after it's dead?
key_t key;
int shmid;
key = ftok("/tmp",'R');
shmid = shmget(key, 1024, 0644 | IPC_CREAT);
//start
char * string;
string = malloc(sizeof(char) * 10);
strcpy(string, "a string");
char * shared_string;
shared_string = shmat(shmid, string, 0);
strcpy(shared_string, string);
Here are some of my thoughts/concerns regarding this:
I'm thinking about sharing the root pointer of the database. I'm not sure if that would work or if I have to mark all allocated memory as shared.
I'm not sure if the parent / other children are able to access memory allocated by a child.
I'm not sure if a child's allocated memory stays on the heap after it is killed, or if that memory is released.
First of all, fork is completely inappropriate for what you're trying to achieve. Even if you can make it work, it's a horrible hack. In general, fork only works for very simplistic programs anyway, and I would go so far as to say that fork should never be used except followed quickly by exec, but that's aside from the point here. You really should be using threads.
With that said, the only way to have memory that's shared between the parent and child after fork, and where the same pointers are valid in both, is to mmap (or shmat, but that's a lot fuglier) a file or anonymous map with MAP_SHARED prior to the fork. You cannot create new shared memory like this after fork because there's no guarantee that it will get mapped at the same address range in both.
Just don't use fork. It's not the right tool for the job.
I think you are basically looking to do what is done by Redis (and probably others).
They describe it in http://redis.io/topics/persistence (search for "copy-on-write").
threads defeat the purpose
classic shared memory (shm, mapped memory) also defeats the purpose
The primary benefit to using this method is avoidance of locking, which can be a pain to get right.
As far as I understand it the idea of using COW is to:
fork when you want to write, not in advance
the child (re)writes the data to disk, then immediately exits
the parent keeps on doing its work, and detects (SIGCHLD) when the child exited.
If while doing its work the parent ends up making changes to the hash, the kernel
will execute a copy for the affected blocks (right terminology?).
A "dirty flag" is used to track if a new fork is needed to execute a new write.
Things to watch out for:
Make sure only one outstanding child
Transactional safety: write to a temp file first, then move it over so that you always have a complete copy, maybe keeping the previous around if the move is not atomic.
test if you will have issues with other resources that get duplicated (file descriptors, global destructors in c++)
You may want to take gander at the redis code as well
I'm thinking about sharing the root pointer of the database. I'm not sure if that would work or if I have to mark all allocated memory as shared.
Each process will have its own private memory range. Copy-on-write is a kernel-space optimization that is transparent to user space.
As others have said, SHM or mmap'd files are the only way to share memory between separate processes.
If you must you fork, the shared memory seems to be the 'only' choice.
Actually, I think in your scene, the thread is more suitable.
If you don't want to be multi-threaded. Here is another choice,you can only use one-process & one-thread mode, like redis
With this mode,you don't need worry about something like lock and if you want to scale, just design a route policy,as route with the hash value of the key
As you have discovered, if you want to share memory between separate processes (from fork or otherwise), you need to use shared memory, either the SYSV shm library or mmap with MAP_SHARED. Unfortunately, these are coarse-grained tools, suitable only for dealing with a small number of large blocks, and not suitable for fine-grained memory management as you would do with malloc/free.
In order to have useful shared memory between processes, you need to build a heap on top of shm or mmap. You can do that with my small shm_malloc library, which allows you to use calls to shm_malloc and shm_free exactly as you would use malloc/free.

Memory management in a lock free queue

We've been looking to use a lock-free queue in our code to reduce lock contention between a single producer and consumer in our current implementation. There are plenty of queue implementations out there, but I haven't been too clear on how to best manage memory management of nodes.
For example, the producer looks like this:
queue.Add( new WorkUnit(...) );
And the consumer looks like:
WorkUnit* unit = queue.RemoveFront();
unit->Execute();
delete unit;
We currently use memory pool for allocation. You'll notice that the producer allocates the memory and the consumer deletes it. Since we're using pools, we need to add another lock to the memory pool to properly protect it. This seems to negate the performance advantage of a lock-free queue in the first place.
So far, I think our options are:
Implement a lock-free memory pool.
Dump the memory pool and rely on a threadsafe allocator.
Are there any other options that we can explore? We're trying to avoid implementing a lock-free memory pool, but we may to take that path.
Thanks.
Only producer shall be able to create objects and destroy them when they aren't needed anymore. Consumer is only able to use objects and mark them as used. That's the point. You don't need to share memory in this case. This is the only way I know of efficient lock free queue implementation.
Read this great article that describes such algorithm in details.
One more possibility is to have a separate memory pool for each thread, so only one thread is ever using a heap, and you can avoid locking for the allocation.
That leaves you with managing a lock-free function to free a block. Fortunately, that's pretty easy to manage: you maintain a linked list of free blocks for each heap. You put the block's own address into (the memory you'll use as) the link field for the linked list, and then do an atomic exchange with the pointer holding the head of the linked list.
You should look at Intel's TBB. I don't know how much, if anything, it costs for commercial projects, but they already have concurrent queues, concurrent memory allocators, that sort of thing.
Your queue interface also looks seriously dodgy- for example, your RemoveFront() call- what if the queue is empty? The new and delete calls also look quite redundant. Intel's TBB and Microsoft's PPL (included in VS2010) do not suffer these issues.
I am not sure what exactly your requirements are, but if you have scenario where producer creates data and push this data on queue and you have single consumer who takes this data and uses it and then destroys it, you just need tread safe queue or you can make your own single linked list that is tread safe in this scenario.
create list element
append data in list element
change pointer of next element from null to list element (or equivalent in other languages)
Consumer can be implemented in any way, and most linked lists are by default tread safe for this kind of operation (but check that whit implementation)
In this scenario, consumer should free this memory, or it can return it in same way back to producer setting predefined flag. Producer does not need to check all flags (if there are more than like 1000 of them) to find which bucket is free, but this flags can be implemented like tree, enabling log(n) searching for available pool. I am sure this can be done in O(1) time without locking, but have no idea how
I had the exact same concern, so I wrote my own lock-free queue (single-producer, single-consumer) that manages the memory allocation for you (it allocates a series of contiguous blocks, sort of like std::vector).
I recently released the code on GitHub. (Also, I posted on my blog about it.)
If you create the node on the stack, enqueue it, then dequeue it to another node on the stack, you won't need to use pointers/manual allocation at all. Additionally, if your node implements move semantics for construction and assignment, it will be automatically moved instead of copied :-)

Will the memory in the heap be released when I do pthread_cancel on iOS?

Lets say I have allocated some memory in my background thread, that is, a thread stack is holding a pointer to that memory. Now I want do terminate background thread execution by calling pthread_cancel on it. Will that memory be released or not? (My platform is iOS, compiler is gcc 4.2)
Each thread by necessity requires its own stack; however there is typically only one heap per process. When a thread is destroyed, there is no automatic mechanism to free the memory allocated on the heap. All you end up with is a memory leak.
As a general rule, avoid using pthread_cancel since it is hard to ensure that pthread_cancel will run safely. Rather build in some mechanism where you can pass a message to the thread to destroy itself (after freeing any resources that it owns).
The thread stack will be removed once the thread exits. But there will be no process or code that will look into your thread stack and release any references for the objects that you've allocated on the heap. Also, typically, thread stacks don't hold any references to the memory, the thread stack is an independent space that is given to the thread to use for generic program stack, any reference will only be on the stack for as long as you are inside the function that pushed such a reference onto the stack, typically because you are referencing it with a local variable.
by default, no -- see other answers, which are more specific to the answer you seek. there is however, such a thing as a thread-specific allocator; if you were using one, you'd know.
No, it won't be deleted or freed automatically. If you're very lucky, it might be garbage collected sometime if you're running a collector. File handles, shared memory ids, mutexes etc. won't be released/deallocated either. Async cancellation is safe for e.g. pure maths calculations on data still owned by another thread, but very risky in general - that's why some threading APIs have experimented with and removed the function completely.

Non-blocking TCP buffer issues

I think I'm in a problem. I have two TCP apps connected to each other which use winsock I/O completion ports to send/receive data (non-blocking sockets).
Everything works just fine until there's a data transfer burst. The sender starts sending incorrect/malformed data.
I allocate the buffers I'm sending on the stack, and if I understand correctly, that's a wrong to do, because these buffers should remain as I sent them until I get the "write complete" notification from IOCP.
Take this for example:
void some_function()
{
char cBuff[1024];
// filling cBuff with some data
WSASend(...); // sending cBuff, non-blocking mode
// filling cBuff with other data
WSASend(...); // again, sending cBuff
// ..... and so forth!
}
If I understand correctly, each of these WSASend() calls should have its own unique buffer, and that buffer can be reused only when the send completes.
Correct?
Now, what strategies can I implement in order to maintain a big sack of such buffers, how should I handle them, how can I avoid performance penalty, etc'?
And, if I am to use buffers that means I should copy the data to be sent from the source buffer to the temporary one, thus, I'd set SO_SNDBUF on each socket to zero, so the system will not re-copy what I already copied. Are you with me? Please let me know if I wasn't clear.
Take a serious look at boost::asio. Asynchronous IO is its specialty (just as the name suggests.) It's pretty mature library by now being in Boost since 1.35. Many people use it in production for very intensive networking. There's a wealth of examples in the documentation.
One thing for sure - it take working with buffers very seriously.
Edit:
Basic idea to handling bursts of input is queuing.
Create, say, three linked lists of pre-allocated buffers - one is for free buffers, one for to-be-processed (received) data, one for to-be-sent data.
Every time you need to send something - take a buffer off the free list (allocate a new one if free list is empty), fill with data, put it onto to-be-sent list.
Every time you need to receive something - take a buffer off the free list as above, give it to IO receive routine.
Periodically take buffers off to-be-sent queue, hand them off to send routine.
On send completion (inline or asynchronous) - put them back onto free list.
On receive completion - put buffer onto to-be-processed list.
Have your "business" routine take buffers off to-be-processed list.
The bursts will then fill that input queue until you are able to process them. You might want to limit the queue size to avoid blowing through all the memory.
I don't think it is a good idea to do a second send before the first send is finished.
Similarly, I don't think it is a good idea to change the buffer before the send is finished.
I would be inclined to store the data in some sort of queue. One thread can keep adding data to the queue. The second thread can work in a loop. Do a send and wait for it to finish. If there is more data do another send, else wait for more data.
You would need a critical section (or some such) to nicely share the queue between the threads and possibly an event or a semaphore for the sending thread to wait on if there is no data ready.
Now, what strategies can I implement in order to maintain a big sack of such buffers, how should I handle them, how can I avoid performance penalty, etc'?
It's difficult to know the answer without knowing more about your specific design. In general I'd avoid maintaining your own "sack of buffers" and instead use the OS's built in sack of buffers - the heap.
But in any case, what I would do in the general case is expose an interface to the callers of your code that mirror what WSASend is doing for overlapped i/o. For example, suppose you are providing an interface to send a specific struct:
struct Foo
{
int x;
int y;
};
// foo will be consumed by SendFoo, and deallocated, don't use it after this call
void SendFoo(Foo* foo);
I would require users of SendFoo allocate a Foo instance with new, and tell them that after calling SendFoo the memory is no longer "owned" by their code and they therefore shouldn't use it.
You can enforce this even further with a little trickery:
// After this operation the resultant foo ptr will no longer point to
// memory passed to SendFoo
void SendFoo(Foo*& foo);
This allows the body of SendFoo to send the address of the memory down to WSASend, but modify the passed in pointer to NULL, severing the link between the caller's code and their memory. Of course, you can't really know what the caller is doing with that address, they may have a copy elsewhere.
This interface also enforces that a single block of memory will be used with each WSASend. You are really treading into more than dangerous territory trying to share one buffer between two WSASend calls.

Arena in Malloc Function

I am using malloc_stats() to print malloc related statistics in which I am finding "Arena 0" for some programs and "Arena 0 and Arena 1" for some other programs.
What do these arenas represent?
The heap code resides inside the glibc component, and is packaged in the libc.so.x shared library. The current implementation of the heap uses multiple independent sub-heaps called arenas. Each arena has its own mutex for concurrency protection. Thus if there are sufficient arenas within a process' heap, and a mechanism to distribute the threads' heap accesses evenly between them, then the potential for contention for the mutexes should be minimal. It turns out that this works well for allocations. In malloc(), a test is made to see if the mutex for current target arena for the current thread is free (trylock). If so then the arena is now locked and the allocation proceeds. If the mutex is busy then each remaining arena is tried in turn and used if the mutex is not busy. In the event that no arena can be locked without blocking, a fresh new arena is created. This arena by definition is not already locked, so the allocation can now proceed without blocking. Lastly, the ID of the arena last used by a thread is retained in thread local storage, and subsequently used as the first arena to try when malloc() is next called by that thread. Therefore all calls to malloc() will proceed without blocking.
See link text. It looks like heap is a collection of arenas ("sub-heaps") to handle memory allocation between several threads, thus reducing contention.
In certain malloc implementations, an "arena" is a pool of memory from which individual allocations are made. The algorithms to determine which arena is used will differ between implementations, so it's not possible for us to explain why you see a difference. One common factor is allocation size.
Everything is there: http://www.gnu.org/software/libc/manual/html_node/Statistics-of-Malloc.html
int arena
This is the total size of memory allocated with sbrk by malloc, in bytes.