Say there is a book library. People can borrow books and return books. There are one or more copies from each book.
Lets assume that:
1. If a person comes to the library with a list of books, he won't leave the library without all of the books.
2. It isn't possible that on some borrower's list, there are books of which the library doesn't have at least a single copy.
We agree that the borrowers are represented by threads.
I can think of only one option to implement it:
public synchronized void borrowBooks(final ArrayList<Item> booksList)
{
try {
while (!areBooksPresent(booksList)) {
this.wait();
}
} catch (InterruptedException e) {}
for (Book book : booksList) {
Book libraryBook = findBook(book);
/* Decrement the book's amount in the library */
libraryBook.decAmount();
}
}
public synchronized void returnBooks(final ArrayList<Item> booksList)
{
for (Book book : booksList) {
Book libraryBook = findBook(book);
/* increment the book's amount in the library */
libraryBook.incAmount();
/* Notify to all awaiting threads that the monitor is freed */
this.notifyAll();
}
}
As you can see, as a thread starts to borrow his books, the whole library is locked and no other thread can borrow at the same time. We also guarantee that in the for loop for borrowing the books, it's impossible that a book is not present.
The main problem is that we lose the whole point of multi-threaded program when we lock the whole library every time a thread borrows books.
The other alternatives seem to cause deadlocks.
Do you have any suggestion for more concurrent solution which is also safe and has liveness?
Keep a structure for each book allowing a thread to block until a particular book is returned.
Have each borrower wait for at least one of the books it needs. When it wakes up, have it check to see if all the books it needs are available. If so, it should take them all and leave. If not, it should pick one of the books it needs that are not available and wait for it.
This may, however, cause some borrowers to wait forever. To avoid this, use a special priority algorithm. Keep a queue of the borrowers in arrival order. Put new borrowers to the back of the line. Then simply implement one rule -- if there is any book the head borrower needs, only the head borrower may take it.
This will leave one special case you have to handle: A borrower is not the head borrower, all the books it needs are available, but it cannot take them because one of the books is needed by the head borrower. In this case, have a special "head borrower has left" event that the borrower can wait for. When a borrower gets all its books, check if it was the head borrower, and if so unblock all threads waiting on that event.
Any given borrower, at any time, has some finite number of borrowers ahead of it. That number cannot increase. And it will decrease because one of the borrowers ahead of you (the head one) has absolute priority and every book is returned eventually. No borrower ever holds a book while it waits for other books. So this should be deadlock free.
Related
I am learning counting semaphores but fail to understand it. I've read almost all the articles but still have a hard time in understanding it.
Here is my understanding with toilet key example (http://niclasw.mbnet.fi/MutexSemaphore.html)
Here,
Toilet - Critical Section
Key - Semaphore
Person - Process
Correct me If I am wrong in the above. My questions are:
1)"A semaphore restricts the number of simultaneous users of a shared resource up to a maximum number". Doesn't simultaneous access of a shared variable lead to race around condition?
2)After a semaphore is acquired by a process, will it be running in it Critical section?(Basing on the example)
3)If a process is in it CS and another process acquires semaphore,will it be running its CS concurrently or will it be waiting?
Excuse me if my questions are rudimentary but I am trying hard to understand it.Please explain semaphore with an EXAMPLE?
let's say there's a vip room. there can be at most 3 people in the room. set your semaphore to 3.
the semaphore is your bodyguard to prevent the 4th person to enter the room.
how does it work? you can call two functions with a semaphore: wait() and signal()
wait() function simply decreases the semaphore value by 1. and if the new value is negative, it has to wait until the value is positive again.
so you started with semaphore = 3. with every person entering the room, it's decreased by 1.
person 1 enters...sem = 2
person 2 enters...sem = 1
person 3 enters...sem = 0
person 4 tries to enter but sem = -1 now, our bodyguard will not let him in. so, he has to wait. until, someone leaves the room!
when you're done with critical section, you signal() to let the outsider know that he can enter now. so as you can tell, this function increases the semaphore value by 1.
mutex is semaphore = 1.
thread A:
mutex.wait(); //makes mutex = 0
CS();
mutex.signal();
thread B:
mutex.wait(); //has to wait until A signals so the mutex = 1 again
CS();
mutex.signal();
with this implementation, while A is running, there's no way that B can reach the CS where he changes some global variable with a context switch. B has to wait until A is done. and vice versa.
to have more understanding with semaphores, you can check famous problems like producer&consumer, barbershop, dining philosophers... there are lots of them. i believe it's more fun that way and the scenarios help you to understand easier.
also, i think this is quite useful.
hope this helps!
I need some advice how to send properly doubly linked list of connected users. Some basic information about my code and my approach so far:
I keep information about all connected users in doubly linked list, which is shared among threads. I store the head of the list in global variable : *PPER_user g_usersList, and struct for users look like:
typedef struct _user {
char id;
char status;
struct _user *pCtxtBack;
struct _user *pCtxtForward;
} user, *PPER_user;
When new user connects to server, data about connected users is gathered from linked list and send to him:
WSABUF wsabuf; PPER_player pTemp1, pTemp2; unsigned int c=0;
.....
EnterCriticalSection(&g_CSuserslist);
pTemp1 = g_usersList;
while( pTemp1 ) {
pTemp2 = pTemp1->pCtxtBack;
wsabuf.buf[c++]=pTemp1->id; // fill buffer with data about all users
wsabuf.buf[c++]=pTemp1->status; //
pTemp1 = pTemp2;
};
WSASend(...,wsabuf,...);
LeaveCriticalSection(&g_CSuserslist);
But a few things about code above makes me confused:
the linked list is rather under heavily usage by other threads. The more connected users (for example 100,1000), the longer period of time list is locked for the entire duration of ghatering data. Should I reconcile with that or find some better way to do this?
it seems that when one thread locks list whilst while loop went trough all chained struct(users) gathering all id, status , other threads should use the same CriticalSection(&g_CSuserslist) when users want to change their own id, status etc. But this will likely kill the performance. Maybe should i change all design of my app or something?
Any insight you may have would be appreciated. Thanks in advance.
The only problem I see in your code (and more generally in the description of your app) is the size of the critical section that protects g_usersList. The rule is avoid any time consuming operation while in critical section.
So you must protect :
adding a new user
removing a user at deconnexion
taking a snapshot of the list for further processing
All those operation are memory only, so unless you go under really heavy conditions, all should be fine provided you put all IO outside of critical sections (1), because it only happens when users are connecting/disconnecting. If you put the WSASend outside of critical section, all should go fine and IMHO it is enough.
Edit per comment :
Your struct user is reasonably small, I would say between 10 and 18 useful bytes (depending on pointer size 4 or 8 bytes), and a total of 12 of 24 bytes including padding. With 1000 connected users you only have to copy less then 24k bytes of memory and having only to test if next user is null (or at most keep the current number of connected user to have a simpler loop). Anyway, maintaining such a buffer should also be done in a critical section. IMHO until you have far more than 1000 users (between 10k and 100k, but you could get other problems ...) a simple global lock (like your critical section) around the whole double linked list of user should be enough. But all that needs to be probed because it may depend of external things like hardware ...
Too Long Don't Read discussion :
As you describe your application, you only gather the list of connected users when a new users connects, so you have exactly one full read per two writes (one at connection and one at deconnection) : IMHO it is no use trying to implement share locks for reading and exclusive ones for writing. If you did many reads between a connection and a deconnection, it won't be same thing, and you should try to allow concurrent reads.
If you really find the contention is too heavy, because you have a very large number of connected users and very frequent connection/disconnection, you could try to implement a row level like locking. Instead of locking the whole list, only lock what you are processing : top and first for an insertion, current record plus previous and next for a deletion, and current and next while reading. But it will be hard to write and test, much more time consuming, because you will have to do many lock/release while reading the list, and you will have to be very cautious to avoid dead lock condition. So my advice is don't do that unless it is really required.
(1) in the code you show, the WSASend(...,wsabuf,...); is inside the critical section when it should be outside. Write instead :
...
LeaveCriticalSection(&g_CSuserslist);
WSASend(...,wsabuf,...);
The first performance problem is the linked list itself: It takes quite a bit longer to traverse a linked list than to traverse an array/std::vector<>. A single linked list has the advantage of allowing thread safe insertion/deletion of elements via atomic types/compare-and-swap operations. A double linked list is much harder to maintain in a thread safe fashion without resorting to mutexes (which are always the big, heavy guns).
So, if you go with mutex to lock the list, use std::vector<>, but you can also solve your problem with a lock-free implementation of a single linked list:
You have a single linked list with one head that is a global, atomic variable.
All entries are immutable once they are published.
When you add a user, take the current head and store it in a thread local variable (atomic read). Since the entries won't change, you have all time in the world to traverse this list, even if other threads add more users while you are traversing it.
To add the new user, create a new list head containing it, then use a compare-and-swap operation to replace the old list head pointer with the new one. If that fails, retry.
To remove a user, traverse the list until you find the user in the list. While you walk the list, copy its contents to newly allocated nodes in a new linked list. Once you find the user to delete, set the next pointer of the last user on the new list to the deleted user's next pointer. Now the new list contains all users of the old one except the removed user. So you can now publish that list by another compare-and-swap on the list head. Unfortunately, you'll have to redo the work should the publishing operation fail.
Do not set the next pointer of the deleted object to NULL, another thread might still need it to find the rest of the list (in its view the object won't have been removed yet).
Do not delete the old list head right away, another thread might still be using it. The best thing to do is to enqueue its nodes in another list for cleanup. This cleanup list should be replaced from time to time with a new one, and the old one should be cleaned up after all threads have given their OK (you can implement this by passing around a token, when it comes back to the originating process, you can safely destroy the old objects.
Since the list head pointer is the only globally visible variable that can ever change, and since that variable is atomic, such an implementation guarantees a total ordering of all add/remove operations.
The "correct" answer is probably to send less data to your users. Do they really NEED to know the id and status of every other user or do they only need to know aggregate information which can be kept up-to-date dynamically.
If your app must send this information (or such changes are considered too much work), then you could cut down your processing significantly by only making this calculation, say, once per second (or even per minute). Then, when someone logged on, they would receive a copy of this information that is, at most, 1 second old.
The real question here is just how urgent it is to send every byte of that list to the new user?
How well does the client side track this list data?
If the client can handle partial updates, wouldn't it make more sense to 'trickle' the data to each user - perhaps using a timestamp to indicate freshness of the data and not have to lock the list in such a massive fashion?
You could also switch to a rwsem style lock where list access is only exclusive if the user intends to modify the list.
Some part of shared memory modified in a critical section consists of considerable amount of data however only small portion of it is changed in a single pass (e.g. free memory pages bitmap).
How to make sure that when program is interrupted/killed the data remains in a consistent state. Any suggestions other than having two copies
(like a copy&swap in an example bellow or having some kind of rollback segment) ?
struct some_data{
int a;
int t[100000]; //large number of total data but a few bytes changed in a single pass (eg. free entries bitmap/tree).
};
short int active=0;
some_data section_data[2];
//---------------------------------------------------
//semaphore down
int inactive=active % 2;
section_data[inactive]=section_data[active];
// now, make changes to the section data (section_data[next_active])
active=inactive;
//semaphore up
You are looking for transactional consistency: a transaction occurs in whole, or not at all.
A common pattern is a journal, where you store the change you intend to make while you apply them. Anyone accessing the shared memory and detecting the crashed process (such as noticing that they somehow acquired the semaphore with a partially present journal), takes responsibility for replaying the journal before continuing.
You still have one race case, the actual writing of a bit signalling to all processes that there is, in fact, a journal to consume. However, that is a small enough body of information that you can send it through whatever channel you please, such as another semaphore or clever use of fences.
It's best if the journal is sufficiently independent of the state of the memory such that the repairing process can just start at the start of the journal and replay the whole thing. If you have to identify which entry in the journal is "next," then you need a whole lot more synchronization.
I'm fairly new to C++ standard library and have been using standard library lists for specific multithreaded implementation. I noticed that there might be a trick with using lists that I have not seen on any tutorial/blog/forum posts and though seems obvious to me does not seem to be considered by anyone. So maybe I'm too new and possibly missing something so hopefully someone smarter than me can perhaps validate what I am trying to achieve or explain to me what I am doing wrong.
So we know that in general standard library containers are not thread safe - but this seems like a guiding statement more than a rule. With lists it seems that there is a level of tolerance for thread safety. Let me explain, we know that lists do not get invalidated if we add/delete from the list. The only iterator that gets invalidated is the deleted item - which you can fix with the following line of code:
it = myList.erase(it)
So now lets say we have two threads and call them thread 1 and thread 2.
Thread 1's responsibility is to add to the list. It treats it as a queue, so it uses the std::list::push_back() function call.
Thread 2's responsibility is to process the data stored in the list as a queue and then after processing it will remove elements from the list.
Its guaranteed that Thread 2 will not remove elements in the list that were just added during its processing and Thread 1 guarantees that it will queue up the necessary data well ahead for Thread 2's processing. However, keep in mind elements can be added during Thread 2's processing.
So it seems that this is a reasonable use of lists in this multithreaded environment without the use of a locks for data protection. The reason why I say its reasonable is because, essentially, Thread 2 will only process data up to now such that it can retreive the current end iterator shown by the following pseudocode:
Thread 2 {
iter = myList.begin();
lock();
iterEnd = myList.end(); // lock data temporarily in order to get the current
// last element in the list
unlock();
// perform necessary processing
while (iter != iterEnd) {
// process data
// ...
// remove element
iter = myList.erase(iter);
}
}
Thread 2 uses a lock for a very short amount of time just to know where to stop processing, but for the most part Thread 1 and Thread 2 don't require any other locking. In addition, Thread 2 can possibly avoid locking too if its scope to know the current last element is flexible.
Does anyone see anything wrong with my suggestion?
Thanks!
Your program is racy. As an example of one obvious data race: std::list is more than just a collection of doubly-linked nodes. It also has, for example, a data member that stores the number of nodes in the list (it needs not be a single data member, but it has to store the count somewhere).
Both of your threads will modify this data member concurrently. Because there is no synchronization of those modifications, your program is racy.
Instances of the Standard Library containers cannot be mutated from multiple threads concurrently without external synchronization.
I am trying to understand the disruptor pattern. I have watched the InfoQ video and tried to read their paper. I understand there is a ring buffer involved, that it is initialized as an extremely large array to take advantage of cache locality, eliminate allocation of new memory.
It sounds like there are one or more atomic integers which keep track of positions. Each 'event' seems to get a unique id and it's position in the ring is found by finding its modulus with respect to the size of the ring, etc., etc.
Unfortunately, I don't have an intuitive sense of how it works. I have done many trading applications and studied the actor model, looked at SEDA, etc.
In their presentation they mentioned that this pattern is basically how routers work; however I haven't found any good descriptions of how routers work either.
Are there some good pointers to a better explanation?
The Google Code project does reference a technical paper on the implementation of the ring buffer, however it is a bit dry, academic and tough going for someone wanting to learn how it works. However there are some blog posts that have started to explain the internals in a more readable way. There is an explanation of ring buffer that is the core of the disruptor pattern, a description of the consumer barriers (the part related to reading from the disruptor) and some information on handling multiple producers available.
The simplest description of the Disruptor is: It is a way of sending messages between threads in the most efficient manner possible. It can be used as an alternative to a queue, but it also shares a number of features with SEDA and Actors.
Compared to Queues:
The Disruptor provides the ability to pass a message onto another threads, waking it up if required (similar to a BlockingQueue). However, there are 3 distinct differences.
The user of the Disruptor defines how messages are stored by extending Entry class and providing a factory to do the preallocation. This allows for either memory reuse (copying) or the Entry could contain a reference to another object.
Putting messages into the Disruptor is a 2-phase process, first a slot is claimed in the ring buffer, which provides the user with the Entry that can be filled with the appropriate data. Then the entry must be committed, this 2-phase approach is necessary to allow for the flexible use of memory mentioned above. It is the commit that makes the message visible to the consumer threads.
It is the responsibility of the consumer to keep track of the messages that have been consumed from the ring buffer. Moving this responsibility away from the ring buffer itself helped reduce the amount of write contention as each thread maintains its own counter.
Compared to Actors
The Actor model is closer the Disruptor than most other programming models, especially if you use the BatchConsumer/BatchHandler classes that are provided. These classes hide all of the complexities of maintaining the consumed sequence numbers and provide a set of simple callbacks when important events occur. However, there are a couple of subtle differences.
The Disruptor uses a 1 thread - 1 consumer model, where Actors use an N:M model i.e. you can have as many actors as you like and they will be distributed across a fixed numbers of threads (generally 1 per core).
The BatchHandler interface provides an additional (and very important) callback onEndOfBatch(). This allows for slow consumers, e.g. those doing I/O to batch events together to improve throughput. It is possible to do batching in other Actor frameworks, however as nearly all other frameworks don't provide a callback at the end of the batch you need to use a timeout to determine the end of the batch, resulting in poor latency.
Compared to SEDA
LMAX built the Disruptor pattern to replace a SEDA based approach.
The main improvement that it provided over SEDA was the ability to do work in parallel. To do this the Disruptor supports multi-casting the same messages (in the same order) to multiple consumers. This avoids the need for fork stages in the pipeline.
We also allow consumers to wait on the results of other consumers without having to put another queuing stage between them. A consumer can simply watch the sequence number of a consumer that it is dependent on. This avoids the need for join stages in pipeline.
Compared to Memory Barriers
Another way to think about it is as a structured, ordered memory barrier. Where the producer barrier forms the write barrier and the consumer barrier is the read barrier.
First we'd like to understand the programming model it offers.
There are one or more writers. There are one or more readers. There is a line of entries, totally ordered from old to new (pictured as left to right). Writers can add new entries on the right end. Every reader reads entries sequentially from left to right. Readers can't read past writers, obviously.
There is no concept of entry deletion. I use "reader" instead of "consumer" to avoid the image of entries being consumed. However we understand that entries on the left of the last reader become useless.
Generally readers can read concurrently and independently. However we can declare dependencies among readers. Reader dependencies can be arbitrary acyclic graph. If reader B depends on reader A, reader B can't read past reader A.
Reader dependency arises because reader A can annotate an entry, and reader B depends on that annotation. For example, A does some calculation on an entry, and stores the result in field a in the entry. A then move on, and now B can read the entry, and the value of a A stored. If reader C does not depend on A, C should not attempt to read a.
This is indeed an interesting programming model. Regardless of the performance, the model alone can benefit lots of applications.
Of course, LMAX's main goal is performance. It uses a pre-allocated ring of entries. The ring is large enough, but it's bounded so that the system will not be loaded beyond design capacity. If the ring is full, writer(s) will wait until the slowest readers advance and make room.
Entry objects are pre-allocated and live forever, to reduce garbage collection cost. We don't insert new entry objects or delete old entry objects, instead, a writer asks for a pre-existing entry, populate its fields, and notify readers. This apparent 2-phase action is really simply an atomic action
setNewEntry(EntryPopulator);
interface EntryPopulator{ void populate(Entry existingEntry); }
Pre-allocating entries also means adjacent entries (very likely) locate in adjacent memory cells, and because readers read entries sequentially, this is important to utilize CPU caches.
And lots of efforts to avoid lock, CAS, even memory barrier (e.g. use a non-volatile sequence variable if there's only one writer)
For developers of readers: Different annotating readers should write to different fields, to avoid write contention. (Actually they should write to different cache lines.) An annotating reader should not touch anything that other non-dependent readers may read. This is why I say these readers annotate entries, instead of modify entries.
Martin Fowler has written an article about LMAX and the disruptor pattern, The LMAX Architecture, which may clarify it further.
I actually took the time to study the actual source, out of sheer curiosity, and the idea behind it is quite simple. The most recent version at the time of writing this post is 3.2.1.
There is a buffer storing pre-allocated events that will hold the data for consumers to read.
The buffer is backed by an array of flags (integer array) of its length that describes the availability of the buffer slots (see further for details). The array is accessed like a java#AtomicIntegerArray, so for the purpose of this explenation you may as well assume it to be one.
There can be any number of producers. When the producer wants to write to the buffer, an long number is generated (as in calling AtomicLong#getAndIncrement, the Disruptor actually uses its own implementation, but it works in the same manner). Let's call this generated long a producerCallId. In a similar manner, a consumerCallId is generated when a consumer ENDS reading a slot from a buffer. The most recent consumerCallId is accessed.
(If there are many consumers, the call with the lowest id is choosen.)
These ids are then compared, and if the difference between the two is lesser that the buffer side, the producer is allowed to write.
(If the producerCallId is greater than the recent consumerCallId + bufferSize, it means that the buffer is full, and the producer is forced to bus-wait until a spot becomes available.)
The producer is then assigned the slot in the buffer based on his callId (which is prducerCallId modulo bufferSize, but since the bufferSize is always a power of 2 (limit enforced on buffer creation), the actuall operation used is producerCallId & (bufferSize - 1)). It is then free to modify the event in that slot.
(The actual algorithm is a bit more complicated, involving caching recent consumerId in a separate atomic reference, for optimisation purposes.)
When the event was modified, the change is "published". When publishing the respective slot in the flag array is filled with the updated flag. The flag value is the number of the loop (producerCallId divided by bufferSize (again since bufferSize is power of 2, the actual operation is a right shift).
In a similar manner there can be any number of consumers. Each time a consumer wants to access the buffer, a consumerCallId is generated (depending on how the consumers were added to the disruptor the atomic used in id generation may be shared or separate for each of them). This consumerCallId is then compared to the most recent producentCallId, and if it is lesser of the two, the reader is allowed to progress.
(Similarly if the producerCallId is even to the consumerCallId, it means that the buffer is empety and the consumer is forced to wait. The manner of waiting is defined by a WaitStrategy during disruptor creation.)
For individual consumers (the ones with their own id generator), the next thing checked is the ability to batch consume. The slots in the buffer are examined in order from the one respective to the consumerCallId (the index is determined in the same manner as for producers), to the one respective to the recent producerCallId.
They are examined in a loop by comparing the flag value written in the flag array, against a flag value generated for the consumerCallId. If the flags match it means that the producers filling the slots has commited their changes. If not, the loop is broken, and the highest commited changeId is returned. The slots from ConsumerCallId to received in changeId can be consumed in batch.
If a group of consumers read together (the ones with shared id generator), each one only takes a single callId, and only the slot for that single callId is checked and returned.
From this article:
The disruptor pattern is a batching queue backed up by a circular
array (i.e. the ring buffer) filled with pre-allocated transfer
objects which uses memory-barriers to synchronize producers and
consumers through sequences.
Memory-barriers are kind of hard to explain and Trisha's blog has done the best attempt in my opinion with this post: http://mechanitis.blogspot.com/2011/08/dissecting-disruptor-why-its-so-fast.html
But if you don't want to dive into the low-level details you can just know that memory-barriers in Java are implemented through the volatile keyword or through the java.util.concurrent.AtomicLong. The disruptor pattern sequences are AtomicLongs and are communicated back and forth among producers and consumers through memory-barriers instead of locks.
I find it easier to understand a concept through code, so the code below is a simple helloworld from CoralQueue, which is a disruptor pattern implementation done by CoralBlocks with which I am affiliated. In the code below you can see how the disruptor pattern implements batching and how the ring-buffer (i.e. circular array) allows for garbage-free communication between two threads:
package com.coralblocks.coralqueue.sample.queue;
import com.coralblocks.coralqueue.AtomicQueue;
import com.coralblocks.coralqueue.Queue;
import com.coralblocks.coralqueue.util.MutableLong;
public class Sample {
public static void main(String[] args) throws InterruptedException {
final Queue<MutableLong> queue = new AtomicQueue<MutableLong>(1024, MutableLong.class);
Thread consumer = new Thread() {
#Override
public void run() {
boolean running = true;
while(running) {
long avail;
while((avail = queue.availableToPoll()) == 0); // busy spin
for(int i = 0; i < avail; i++) {
MutableLong ml = queue.poll();
if (ml.get() == -1) {
running = false;
} else {
System.out.println(ml.get());
}
}
queue.donePolling();
}
}
};
consumer.start();
MutableLong ml;
for(int i = 0; i < 10; i++) {
while((ml = queue.nextToDispatch()) == null); // busy spin
ml.set(System.nanoTime());
queue.flush();
}
// send a message to stop consumer...
while((ml = queue.nextToDispatch()) == null); // busy spin
ml.set(-1);
queue.flush();
consumer.join(); // wait for the consumer thread to die...
}
}