Locking a queue while re-ordering it in Coldfusion - coldfusion

please consider the following:
I have a queue of objects represented
as an array.
I process them off the top of the
array (at position 1) before calling
arrayDeleteAt() to remove it from
the array.
I add new queue item at the top of
the array using arrayAppend().
This works fine. However, I now wish to re-order the array immediately after adding an item.
I am concerned that if a thread is taking from the queue it will find the queue order has changed between it taking the item at position 1 and it deleting the item at position 1 - because in that time an additional item has been added the the queue has been re-sorted. So I need to ensure my queue is thread-safe.
Is there any way to doing this using the cflock tag? Since my add and remove code are in different places in the code the thread executing one bit of code would need to know that a thread is executing another specific bit of code and halt until that other thread has stopped executing it's code.
Or am I better off just raising a flag while the sorting is going on and preventing anything being taken from the array while the sort is in progress?
All this is happening in the APPLICATION scope on a CF 8 Enterprise server.
Thanks in advance for any help.
Ciaran

An exclusive CFLOCK should do what you want. You could just scope-lock APPLICATION, but that might be overly broad. Probably best to do it as a named lock. It won't matter where the different bits of code with the lock are located, as long as they're all using the same name.

Related

Thread Safe Integer Array?

I have a situation where I have a legacy multi-threaded application I'm trying to move to a linux platform and convert into C++.
I have a fixed size array of integers:
int R[5000];
And I perform a lot of operations like:
R[5] = (R[10] + R[20]) / 50;
R[5]++;
I have one Foreground task that mostly reads the values....but on occasion can update one. And then I have a background worker that is updating the values constantly.
I need to make this structure thread safe.
I would rather only update the value if the value has actually changed. The worker is constantly collecting data and doing calculation and storing the data whether it changes or not.
So should I create a custom class MyInt which has the structure and then include an array of mutexes to lock for updating/reading each value and then overload the [], =, ++, +=, -=, etc? Or should I try to implement anatomic integer array?
Any suggestions as to what that would look like? I'd like to try and keep the above notation for doing the updates...but I get that it might not be possible.
Thanks,
WB
The first thing to do is make the program work reliably, and the easiest way to do that is to have a Mutex that is used to control access to the entire array. That is, whenever either thread needs to read or write to anything in the array, it should do:
the_mutex.lock();
// do all the array-reads, calculations, and array-writes it needs to do
the_mutex.unlock();
... then test your program and see if it still runs fast enough for your needs. If so, you're done; that's all you need to do.
If you find that the program isn't fast enough due to contention on the mutex, you can start trying optimizations to make things faster. For example, if you know that your threads' operations will only need to work on local segments of the array at one time, you could create multiple mutexes, and assign different subsets of the array to each mutex (e.g. mutex #1 is used to serialize access to the first 100 array items, mutex #2 for the second 100 array items, etc). That will greatly decrease the chances of one thread having to wait for the other thread to release a mutex before it can continue.
If things still aren't fast enough for you, you could then look in to having two different arrays, one for each thread, and occasionally copying from one array to the other. That way each thread could safely access its own private array without any serialization needed. The copying operation would need to be handled carefully, probably using some sort of inter-thread message-passing protocol.

Sending list of connected users to newly connected user in multithreaded iocp server

I need some advice how to send properly doubly linked list of connected users. Some basic information about my code and my approach so far:
I keep information about all connected users in doubly linked list, which is shared among threads. I store the head of the list in global variable : *PPER_user g_usersList, and struct for users look like:
typedef struct _user {
char id;
char status;
struct _user *pCtxtBack;
struct _user *pCtxtForward;
} user, *PPER_user;
When new user connects to server, data about connected users is gathered from linked list and send to him:
WSABUF wsabuf; PPER_player pTemp1, pTemp2; unsigned int c=0;
.....
EnterCriticalSection(&g_CSuserslist);
pTemp1 = g_usersList;
while( pTemp1 ) {
pTemp2 = pTemp1->pCtxtBack;
wsabuf.buf[c++]=pTemp1->id; // fill buffer with data about all users
wsabuf.buf[c++]=pTemp1->status; //
pTemp1 = pTemp2;
};
WSASend(...,wsabuf,...);
LeaveCriticalSection(&g_CSuserslist);
But a few things about code above makes me confused:
the linked list is rather under heavily usage by other threads. The more connected users (for example 100,1000), the longer period of time list is locked for the entire duration of ghatering data. Should I reconcile with that or find some better way to do this?
it seems that when one thread locks list whilst while loop went trough all chained struct(users) gathering all id, status , other threads should use the same CriticalSection(&g_CSuserslist) when users want to change their own id, status etc. But this will likely kill the performance. Maybe should i change all design of my app or something?
Any insight you may have would be appreciated. Thanks in advance.
The only problem I see in your code (and more generally in the description of your app) is the size of the critical section that protects g_usersList. The rule is avoid any time consuming operation while in critical section.
So you must protect :
adding a new user
removing a user at deconnexion
taking a snapshot of the list for further processing
All those operation are memory only, so unless you go under really heavy conditions, all should be fine provided you put all IO outside of critical sections (1), because it only happens when users are connecting/disconnecting. If you put the WSASend outside of critical section, all should go fine and IMHO it is enough.
Edit per comment :
Your struct user is reasonably small, I would say between 10 and 18 useful bytes (depending on pointer size 4 or 8 bytes), and a total of 12 of 24 bytes including padding. With 1000 connected users you only have to copy less then 24k bytes of memory and having only to test if next user is null (or at most keep the current number of connected user to have a simpler loop). Anyway, maintaining such a buffer should also be done in a critical section. IMHO until you have far more than 1000 users (between 10k and 100k, but you could get other problems ...) a simple global lock (like your critical section) around the whole double linked list of user should be enough. But all that needs to be probed because it may depend of external things like hardware ...
Too Long Don't Read discussion :
As you describe your application, you only gather the list of connected users when a new users connects, so you have exactly one full read per two writes (one at connection and one at deconnection) : IMHO it is no use trying to implement share locks for reading and exclusive ones for writing. If you did many reads between a connection and a deconnection, it won't be same thing, and you should try to allow concurrent reads.
If you really find the contention is too heavy, because you have a very large number of connected users and very frequent connection/disconnection, you could try to implement a row level like locking. Instead of locking the whole list, only lock what you are processing : top and first for an insertion, current record plus previous and next for a deletion, and current and next while reading. But it will be hard to write and test, much more time consuming, because you will have to do many lock/release while reading the list, and you will have to be very cautious to avoid dead lock condition. So my advice is don't do that unless it is really required.
(1) in the code you show, the WSASend(...,wsabuf,...); is inside the critical section when it should be outside. Write instead :
...
LeaveCriticalSection(&g_CSuserslist);
WSASend(...,wsabuf,...);
The first performance problem is the linked list itself: It takes quite a bit longer to traverse a linked list than to traverse an array/std::vector<>. A single linked list has the advantage of allowing thread safe insertion/deletion of elements via atomic types/compare-and-swap operations. A double linked list is much harder to maintain in a thread safe fashion without resorting to mutexes (which are always the big, heavy guns).
So, if you go with mutex to lock the list, use std::vector<>, but you can also solve your problem with a lock-free implementation of a single linked list:
You have a single linked list with one head that is a global, atomic variable.
All entries are immutable once they are published.
When you add a user, take the current head and store it in a thread local variable (atomic read). Since the entries won't change, you have all time in the world to traverse this list, even if other threads add more users while you are traversing it.
To add the new user, create a new list head containing it, then use a compare-and-swap operation to replace the old list head pointer with the new one. If that fails, retry.
To remove a user, traverse the list until you find the user in the list. While you walk the list, copy its contents to newly allocated nodes in a new linked list. Once you find the user to delete, set the next pointer of the last user on the new list to the deleted user's next pointer. Now the new list contains all users of the old one except the removed user. So you can now publish that list by another compare-and-swap on the list head. Unfortunately, you'll have to redo the work should the publishing operation fail.
Do not set the next pointer of the deleted object to NULL, another thread might still need it to find the rest of the list (in its view the object won't have been removed yet).
Do not delete the old list head right away, another thread might still be using it. The best thing to do is to enqueue its nodes in another list for cleanup. This cleanup list should be replaced from time to time with a new one, and the old one should be cleaned up after all threads have given their OK (you can implement this by passing around a token, when it comes back to the originating process, you can safely destroy the old objects.
Since the list head pointer is the only globally visible variable that can ever change, and since that variable is atomic, such an implementation guarantees a total ordering of all add/remove operations.
The "correct" answer is probably to send less data to your users. Do they really NEED to know the id and status of every other user or do they only need to know aggregate information which can be kept up-to-date dynamically.
If your app must send this information (or such changes are considered too much work), then you could cut down your processing significantly by only making this calculation, say, once per second (or even per minute). Then, when someone logged on, they would receive a copy of this information that is, at most, 1 second old.
The real question here is just how urgent it is to send every byte of that list to the new user?
How well does the client side track this list data?
If the client can handle partial updates, wouldn't it make more sense to 'trickle' the data to each user - perhaps using a timestamp to indicate freshness of the data and not have to lock the list in such a massive fashion?
You could also switch to a rwsem style lock where list access is only exclusive if the user intends to modify the list.

How to design a datastructure that spits out one available space for each thread in CUDA

In my Project with CUDA I need to have a data structure(available to all threads in the block)that is similar a "stash". In this stash there are multiple spaces which could be either empty or full. I need this data structure to spit out empty space when each thread asks for. The thread will ask for a space in the stash, put something in, and mark this position as full. I could not use a fifo because fetching from stash is random. Any position(and multiple positions)could be marked as empty or full.
The initial version I have is that I use an array to represent whether the space is empty or not. each thread will loop through each position space(using atomicCAS) until it finds a empty spot. But this algorithm the searching time depends on how full the stash is, which is not acceptable in my design.
How could I design a datastructure that the fetching time and write back time does not depend on how full the stash is?
Does this remind anyone of anything any algorithm similar?
Thanks
You could implement this with a FIFO containing a list of free locations.
At startup you fill the FIFO with all locations.
Then whenever you want a space, you take the next element from the FIFO .
When you are finished with the slot, you can place the address back into the FIFO again.
This should have O(1) allocation and deallocation time.
You could implement a hash table (SeparateChaining) with ThreadID as the key.
It is more or less similar to array of linked lists. This way you need not put a lock on the entire array as you did earlier. Instead, you use atomicCAS only while reading a linkedlist from a specific index. Thereby, you can have n threads running in parallel where array size is n.
Note: The distribution of threads however depends on the hash function.

Efficiently iterating through a map while inserting on other thread

I have an std::map < std::string, std::string > which is having values added to it at irregular intervals from one thread (but frequently and needs to be very fast), and occasionally having groups of entries removed.
I need from a different thread to dump a snapshot of the map as text to a debug log on command from a user.
Clearly it's not thread safe to just iterate through the map outputting the debug information while it could be updated so I'm currently taking a read lock (mutex) before dumping the data and a write lock for every insert or delete. This works fine, but I can't really lock the map for this long, it delays the processing of incoming updates too much.
I don't believe I can lock and unlock the debug dump thread for each item as modifying the map from the other thread can invalidate the iterator I believe.
Is there any way I can do this safely without having to take out a read lock on the whole data structure while I write it out so that new values can still be inserted quickly? I realise I won't be able to get a guarenteed consistent view of the data if values can be added and removed while I'm iterating though it, but as long as it's safe that's understood.
If there is no way to use a map for this, can anyone suggest any other data structure I could use?
edit: I'm hoping for a solution that means I don't need to take out an expensive lock when adding an item.
There are 2 solutions I can see at this moment:
(Easy, but might still take too long): copy the map (or assign to another container) while locked and then dump the local copy to the debug log while not locked
(Some more work): delegate the updates of the map to another thread via a queue. If the other thread is the one that dumps to the debug log, then you don't need the locks anymore. This way the fast threads are only locked while accessing the queue.

Why does one loop take longer to detect a shared memory update than another loop?

I've written a 'server' program that writes to shared memory, and a client program that reads from the memory. The server has different 'channels' that it can be writing to, which are just different linked lists that it's appending items too. The client is interested in some of the linked lists, and wants to read every node that's added to those lists as it comes in, with the minimum latency possible.
I have 2 approaches for the client:
For each linked list, the client keeps a 'bookmark' pointer to keep its place within the linked list. It round robins the linked lists, iterating through all of them over and over (it loops forever), moving each bookmark one node forward each time if it can. Whether it can is determined by the value of a 'next' member of the node. If it's non-null, then jumping to the next node is safe (the server switches it from null to non-null atomically). This approach works OK, but if there are a lot of lists to iterate over, and only a few of them are receiving updates, the latency gets bad.
The server gives each list a unique ID. Each time the server appends an item to a list, it also appends the ID number of the list to a master 'update list'. The client only keeps one bookmark, a bookmark into the update list. It endlessly checks if the bookmark's next pointer is non-null ( while(node->next_ == NULL) {} ), if so moves ahead, reads the ID given, and then processes the new node on the linked list that has that ID. This, in theory, should handle large numbers of lists much better, because the client doesn't have to iterate over all of them each time.
When I benchmarked the latency of both approaches (using gettimeofday), to my surprise #2 was terrible. The first approach, for a small number of linked lists, would often be under 20us of latency. The second approach would have small spats of low latencies but often be between 4,000-7,000us!
Through inserting gettimeofday's here and there, I've determined that all of the added latency in approach #2 is spent in the loop repeatedly checking if the next pointer is non-null. This is puzzling to me; it's as if the change in one process is taking longer to 'publish' to the second process with the second approach. I assume there's some sort of cache interaction going on I don't understand. What's going on?
Update: Originally, approach #2 used a condition variable, so that if node->next_ == NULL it would wait on the condition, and the server would notify on the condition everytime it issued an update. The latency was the same, and in trying to figure out why I reduced the code down to the approach above. I'm running on a multicore machine, so one process spinlocking shouldn't affect the other.
Update 2: node->next_ is volatile.
Since it sounds like reads and writes are occurring on separate CPUs, perhaps a memory barrier would help? Your writes may not be occurring when you expect them to be.
You are doing a Spin Lock in #2, which is generally not such a great idea, and is chewing up cycles.
Have you tried adding a yield after each failed polling-attempt in your second approach? Just a guess, but it may reduce the power-looping.
With Boost.Thread this would look like this:
while(node->next_ == NULL) {
boost::this_thread::yield( );
}