Is this an appropriate use for shared_ptr? - c++

Project: typical chat program. Server must receive text from multiple clients and fan each input out to all clients.
In the server I want to have each client to have a struct containing the socket fd and a std::queue. Each structure will be on a std::list.
As input is received from a client socket I want to iterate over the list of structs and put new input into each client struct's queue. A string is new[ed] because I don't want copies of the string multiplied over all the clients. But I also want to avoid the headache of have multiple pointers to the string spread out and deciding when it is time to finally delete the string.
Is this an appropriate occassion for a shared pointer? If so, is the shared_ptr incremented each time I push them into the queue and decremented when I pop them from the queue?
Thanks for any help.

This is a case where a pseudo-garbage collector system will work much better than reference counting.
You need only one list of strings, because you "fan every input out to all clients". Because you will add to one end and remove from the other, a deque is an appropriate data structure.
Now, each connection needs only to keep track of the index of the last string it sent. Periodically (every 1000th message received, or every 4MB received, or something like that), you find the minimum of this index across all clients, and delete strings up to that point. This periodic check is also an opportunity to detect clients which have fallen far behind (possible broken connection) and recover. Without this check, a single stuck client will cause your program to leak memory (even under the reference counting scheme).
This scheme is several times less data than reference counting, and also removes one of the major points of cache contention (reference counts must be written from multiple threads, so they ruin performance). If you aren't using threads, it'll still be faster.

That is an appropriate use of a shared_ptr. And yes, the use count will be increment because a new shared_ptr will be create to push.

Related

Sending list of connected users to newly connected user in multithreaded iocp server

I need some advice how to send properly doubly linked list of connected users. Some basic information about my code and my approach so far:
I keep information about all connected users in doubly linked list, which is shared among threads. I store the head of the list in global variable : *PPER_user g_usersList, and struct for users look like:
typedef struct _user {
char id;
char status;
struct _user *pCtxtBack;
struct _user *pCtxtForward;
} user, *PPER_user;
When new user connects to server, data about connected users is gathered from linked list and send to him:
WSABUF wsabuf; PPER_player pTemp1, pTemp2; unsigned int c=0;
.....
EnterCriticalSection(&g_CSuserslist);
pTemp1 = g_usersList;
while( pTemp1 ) {
pTemp2 = pTemp1->pCtxtBack;
wsabuf.buf[c++]=pTemp1->id; // fill buffer with data about all users
wsabuf.buf[c++]=pTemp1->status; //
pTemp1 = pTemp2;
};
WSASend(...,wsabuf,...);
LeaveCriticalSection(&g_CSuserslist);
But a few things about code above makes me confused:
the linked list is rather under heavily usage by other threads. The more connected users (for example 100,1000), the longer period of time list is locked for the entire duration of ghatering data. Should I reconcile with that or find some better way to do this?
it seems that when one thread locks list whilst while loop went trough all chained struct(users) gathering all id, status , other threads should use the same CriticalSection(&g_CSuserslist) when users want to change their own id, status etc. But this will likely kill the performance. Maybe should i change all design of my app or something?
Any insight you may have would be appreciated. Thanks in advance.
The only problem I see in your code (and more generally in the description of your app) is the size of the critical section that protects g_usersList. The rule is avoid any time consuming operation while in critical section.
So you must protect :
adding a new user
removing a user at deconnexion
taking a snapshot of the list for further processing
All those operation are memory only, so unless you go under really heavy conditions, all should be fine provided you put all IO outside of critical sections (1), because it only happens when users are connecting/disconnecting. If you put the WSASend outside of critical section, all should go fine and IMHO it is enough.
Edit per comment :
Your struct user is reasonably small, I would say between 10 and 18 useful bytes (depending on pointer size 4 or 8 bytes), and a total of 12 of 24 bytes including padding. With 1000 connected users you only have to copy less then 24k bytes of memory and having only to test if next user is null (or at most keep the current number of connected user to have a simpler loop). Anyway, maintaining such a buffer should also be done in a critical section. IMHO until you have far more than 1000 users (between 10k and 100k, but you could get other problems ...) a simple global lock (like your critical section) around the whole double linked list of user should be enough. But all that needs to be probed because it may depend of external things like hardware ...
Too Long Don't Read discussion :
As you describe your application, you only gather the list of connected users when a new users connects, so you have exactly one full read per two writes (one at connection and one at deconnection) : IMHO it is no use trying to implement share locks for reading and exclusive ones for writing. If you did many reads between a connection and a deconnection, it won't be same thing, and you should try to allow concurrent reads.
If you really find the contention is too heavy, because you have a very large number of connected users and very frequent connection/disconnection, you could try to implement a row level like locking. Instead of locking the whole list, only lock what you are processing : top and first for an insertion, current record plus previous and next for a deletion, and current and next while reading. But it will be hard to write and test, much more time consuming, because you will have to do many lock/release while reading the list, and you will have to be very cautious to avoid dead lock condition. So my advice is don't do that unless it is really required.
(1) in the code you show, the WSASend(...,wsabuf,...); is inside the critical section when it should be outside. Write instead :
...
LeaveCriticalSection(&g_CSuserslist);
WSASend(...,wsabuf,...);
The first performance problem is the linked list itself: It takes quite a bit longer to traverse a linked list than to traverse an array/std::vector<>. A single linked list has the advantage of allowing thread safe insertion/deletion of elements via atomic types/compare-and-swap operations. A double linked list is much harder to maintain in a thread safe fashion without resorting to mutexes (which are always the big, heavy guns).
So, if you go with mutex to lock the list, use std::vector<>, but you can also solve your problem with a lock-free implementation of a single linked list:
You have a single linked list with one head that is a global, atomic variable.
All entries are immutable once they are published.
When you add a user, take the current head and store it in a thread local variable (atomic read). Since the entries won't change, you have all time in the world to traverse this list, even if other threads add more users while you are traversing it.
To add the new user, create a new list head containing it, then use a compare-and-swap operation to replace the old list head pointer with the new one. If that fails, retry.
To remove a user, traverse the list until you find the user in the list. While you walk the list, copy its contents to newly allocated nodes in a new linked list. Once you find the user to delete, set the next pointer of the last user on the new list to the deleted user's next pointer. Now the new list contains all users of the old one except the removed user. So you can now publish that list by another compare-and-swap on the list head. Unfortunately, you'll have to redo the work should the publishing operation fail.
Do not set the next pointer of the deleted object to NULL, another thread might still need it to find the rest of the list (in its view the object won't have been removed yet).
Do not delete the old list head right away, another thread might still be using it. The best thing to do is to enqueue its nodes in another list for cleanup. This cleanup list should be replaced from time to time with a new one, and the old one should be cleaned up after all threads have given their OK (you can implement this by passing around a token, when it comes back to the originating process, you can safely destroy the old objects.
Since the list head pointer is the only globally visible variable that can ever change, and since that variable is atomic, such an implementation guarantees a total ordering of all add/remove operations.
The "correct" answer is probably to send less data to your users. Do they really NEED to know the id and status of every other user or do they only need to know aggregate information which can be kept up-to-date dynamically.
If your app must send this information (or such changes are considered too much work), then you could cut down your processing significantly by only making this calculation, say, once per second (or even per minute). Then, when someone logged on, they would receive a copy of this information that is, at most, 1 second old.
The real question here is just how urgent it is to send every byte of that list to the new user?
How well does the client side track this list data?
If the client can handle partial updates, wouldn't it make more sense to 'trickle' the data to each user - perhaps using a timestamp to indicate freshness of the data and not have to lock the list in such a massive fashion?
You could also switch to a rwsem style lock where list access is only exclusive if the user intends to modify the list.

How to design a datastructure that spits out one available space for each thread in CUDA

In my Project with CUDA I need to have a data structure(available to all threads in the block)that is similar a "stash". In this stash there are multiple spaces which could be either empty or full. I need this data structure to spit out empty space when each thread asks for. The thread will ask for a space in the stash, put something in, and mark this position as full. I could not use a fifo because fetching from stash is random. Any position(and multiple positions)could be marked as empty or full.
The initial version I have is that I use an array to represent whether the space is empty or not. each thread will loop through each position space(using atomicCAS) until it finds a empty spot. But this algorithm the searching time depends on how full the stash is, which is not acceptable in my design.
How could I design a datastructure that the fetching time and write back time does not depend on how full the stash is?
Does this remind anyone of anything any algorithm similar?
Thanks
You could implement this with a FIFO containing a list of free locations.
At startup you fill the FIFO with all locations.
Then whenever you want a space, you take the next element from the FIFO .
When you are finished with the slot, you can place the address back into the FIFO again.
This should have O(1) allocation and deallocation time.
You could implement a hash table (SeparateChaining) with ThreadID as the key.
It is more or less similar to array of linked lists. This way you need not put a lock on the entire array as you did earlier. Instead, you use atomicCAS only while reading a linkedlist from a specific index. Thereby, you can have n threads running in parallel where array size is n.
Note: The distribution of threads however depends on the hash function.

how to design a fixed TIME length circular buffer?

boost::circular_buffer cann provide a fixed length buffer, e.g., at size of 5.
Imaging that I have realtime data stream coming in with timestamp. I want to keep a buffer of all the elemens within the last 5 minutes.
Naively, i can build a wrapper of std::list, everytime a new data point D coming in, I push_back(D), and then do a while loop to pop_front() all the data point older than 5 minutes.
The problem with such a design is that, I have to construct a new instance for every point, this seems to be a waste of time (this is a very heavily used object)
does anyone here have a more elegant solution?
thanks!
A list or a deque are both suitable for ring buffers. If your objects are trivially copyable and small you can just used the deque and probably not worry about the memory instances. If you have larger data you can use the list and a custom object pool (so that old unusued objects will be reused for future additions).
If you don't like the std collection object pool semantics (crappy prior to C++11, I'm not sure now), then you can simply store pointers in the deque and manage your own memory.

Is it possible to send array over network?

I'm using C++ and wondering if I can just send an entire int array over a network (using basic sockets) without doing anything. Or do I have to split the data up and send it one at a time?
Yes.
An array will be laid out sequentially in memory so you are free to do this. Simply pass in the address of the first element and the amount of data and you'll send all data.
You could definitely send an array in one send, however you might want to do some additional work. There are issues with interpreting it correctly at the receiving end. For example, if using different machine architectures, you may want to convert the integers to network order (e.g., htonl).
Another thing to keep in mind is the memory layout. If it is a simple array of integers, then it would be contiguous in memory and a single send could successfully capture all the data. If, though, (and this is probably obvious), you have an array with other data, then the layout definitely needs consideration. A simple example would be if the array had pointers to other data such as a character string, then a send of the array would be sending pointers (and not data) and would be meaningless to the receiver.

Why does one loop take longer to detect a shared memory update than another loop?

I've written a 'server' program that writes to shared memory, and a client program that reads from the memory. The server has different 'channels' that it can be writing to, which are just different linked lists that it's appending items too. The client is interested in some of the linked lists, and wants to read every node that's added to those lists as it comes in, with the minimum latency possible.
I have 2 approaches for the client:
For each linked list, the client keeps a 'bookmark' pointer to keep its place within the linked list. It round robins the linked lists, iterating through all of them over and over (it loops forever), moving each bookmark one node forward each time if it can. Whether it can is determined by the value of a 'next' member of the node. If it's non-null, then jumping to the next node is safe (the server switches it from null to non-null atomically). This approach works OK, but if there are a lot of lists to iterate over, and only a few of them are receiving updates, the latency gets bad.
The server gives each list a unique ID. Each time the server appends an item to a list, it also appends the ID number of the list to a master 'update list'. The client only keeps one bookmark, a bookmark into the update list. It endlessly checks if the bookmark's next pointer is non-null ( while(node->next_ == NULL) {} ), if so moves ahead, reads the ID given, and then processes the new node on the linked list that has that ID. This, in theory, should handle large numbers of lists much better, because the client doesn't have to iterate over all of them each time.
When I benchmarked the latency of both approaches (using gettimeofday), to my surprise #2 was terrible. The first approach, for a small number of linked lists, would often be under 20us of latency. The second approach would have small spats of low latencies but often be between 4,000-7,000us!
Through inserting gettimeofday's here and there, I've determined that all of the added latency in approach #2 is spent in the loop repeatedly checking if the next pointer is non-null. This is puzzling to me; it's as if the change in one process is taking longer to 'publish' to the second process with the second approach. I assume there's some sort of cache interaction going on I don't understand. What's going on?
Update: Originally, approach #2 used a condition variable, so that if node->next_ == NULL it would wait on the condition, and the server would notify on the condition everytime it issued an update. The latency was the same, and in trying to figure out why I reduced the code down to the approach above. I'm running on a multicore machine, so one process spinlocking shouldn't affect the other.
Update 2: node->next_ is volatile.
Since it sounds like reads and writes are occurring on separate CPUs, perhaps a memory barrier would help? Your writes may not be occurring when you expect them to be.
You are doing a Spin Lock in #2, which is generally not such a great idea, and is chewing up cycles.
Have you tried adding a yield after each failed polling-attempt in your second approach? Just a guess, but it may reduce the power-looping.
With Boost.Thread this would look like this:
while(node->next_ == NULL) {
boost::this_thread::yield( );
}