I'm doing an academic research in trying to develop a programming tool that assists in implementing fine-grained locking functions, for concurrent programs that maintain tree-like data structures.
For example, the programmer may write some functions that receive a tree root-node and modify the tree (by traversing on some routes and adding/removing nodes), and the tool will help him to find where in the code nodes should be locked and where they can be released - so the functions could be executed concurrently on the same tree.
I am looking for some real-life code examples in which such fine-grained locking is used, or where it could be used for better performance but the programmer was too lazy to implement it (for example, he locked the whole tree during the function-call without releasing useless nodes).
I read about JCR & Jackrabbit, which use a tree-shaped database, and found an article that explains how to lock nodes in JCR (but without examples):
http://www.day.com/specs/jcr/2.0/17_Locking.html
I have a really small background in databases, and I don't fully understand what is allowed and what is not allowed when it comes to Jackrabbit databases and concurrency. Accessing the same node from 2 threads is not allowed, but what about different repositories? And what happens if 2 different clients try to access the same node (for example, one tries to delete it, and another one tries to modify it - will the session.save() fail?).
Thanks,
Oren
First of all, don't get confused between databases/jackrabbit/locking. Jackrabbit implements its own locking, as do databases.
Jackrabbit allows you to lock nodes using LockManager.lock(). Setting the isDeep parameter to true means all the nodes underneath will also be locked. A locked node can be read by another session but can't be modified.
Technically speaking, 2 threads COULD edit the same node if they are using the same session but this is rather hazardous and should probably be avoided.
If a node is likely to be modified by 2 concurrent sessions then you should always lock the node. Which ever session gets there last should wait for the lock to be released. If you don't lock then at least one of the sessions will throw an exception.
I'm not sure what you mean by accessing nodes from different repositories. A node can only belong to one repository. If you mean having 2 jackrabbit instances accessing the same database then this too should be avoided or you should look to using clustering.
When implementing locking it is going to depend on your design and requirements. There's no point locking if you'll only ever have one session and vice-versa. Whether you lock a node or sub-tree is going to depend on what your data represents. For example, if a node represents a folder you'll probably want to lock just the node and not the whole sub-tree. If a sub-tree represents a complex document then you probably will want to lock the sub-tree.
As for locking the whole tree, I hope I don't meet someone who does that!
Related
Having read the following statement from the official documentation of OrientDB:
In order to guarantee atomicity and consistency, OrientDB acquire an
exclusive lock on the storage during transaction commit.
I am wondering if my understanding of the situation is correct. Here is how I assume this will work:
Thread 1 opens a transaction, and reads records #1:100 to #1:200, some from class A, and some from class B (something which cannot be determined without the transaction coming to a close).
Thread 1 massages the data, maybe even inserting a few records.
Thread 1 starts to commit the data. As the database does not have any way to know which parts of the data might be effected by the open transaction, it will blindly block the whole storage unit and verify the #version to enforce optimistic locking on all possibly affected records.
Thread 2 tries to read record #1:1 (or any other record from the whole database) and is blocked by the commit process, which is aligned, AFAIK with exclusive locking on the storage unit. This block occurs, if I'm not off, regardless of the cluster the original data resides on, since we have multi-master datasets.
Thread 1 ends the commit process and the database becomes consistent, effectively lifting the lock.
At this point, any thread can operate on the dataset, transactionally or otherwise, and will not be bound by the exclusive locking mechanism.
If this is the case, during the exchange highlighted in point 3 the data store, in its entirety is in an effective trance state and cannot be reached to, read from, or interacted with in any meaningful way.
I do so hope that I am missing my guess.
Disclaimer: I have not had the chance to dig into the underlying code from the rather rich OrientDB codebase. As such, this is, at its best, an educated guess and should not be taken as any sort of reference as to how OrientDB actually operates.
Possible Workarounds:
Should worse come to worse and this happens to be the way OrientDB actually works, I would dearly welcome any workarounds to this conundrum. We are looking for meaningful ways that will still keep OrientDB as a viable option for an enterprise, scalable high-end application.
In current release of OrientDB, transactions lock the storage in exclusive mode. Fortunately OrientDB works in optimistic way and this is done "only" at commit() time. So no matter when the transaction is begun.
If this is a showstopper for your use case, you could consider to:
don't use transactions. In this case you'll go in parallel with no locks, but consider using indexes requires the usage of lock at index level. In case the index is a bottleneck, the most common workaround is to create X sub-classes with an index on each. OrientDB will use the index of sub-classes if needed and on CRUD operation only the specific index will be locked
wait for OrientDB 3.0 where this limitation will be removed with real parallel transaction execution
I am fairly new to the concept of multi-threading and was exploring to see some interesting problems in order to get a better idea.
One of my friends suggested the following:
"It is fairly straight-forward to have a linked-list and do the regular insert, search and delete operations. But how would you do these operations if multiple threads need to work on the same list.
How many locks are required minimum. How many locks can we have to have optimized linked list functions?"
Giving some thought, I feel that a single lock should be sufficient to work with. We acquire the lock for every single read and write operation. By this I mean when we are accessing a node data in a list we acquire the lock. When we are inserting/deleting elements, we acquire lock for the complete series of steps.
But I was not able to think of a way where using more locks will give us more optimized performance.
Any help/pointers?
The logical extension of "one lock per list" would be "one lock per item".
The case when this would be useful would e.g. be if you're often only modifying a single item of the list.
For deletion and insertion, acquiring the proper locks gets more complicated, though. You'd have to acquire the lock for the item before and after, and you'd have to make sure to always acquire them in the same order (to prevent deadlocks). And there's of course also special cases to be considered if the root element has to be modified (and possibly also if it's a double-linked list or a circular linked list). This overhead resulting from the more complicated locking logic might lead to your implementation being slower again, especially if you often have to insert and delete from the list.
So I would only consider this if the majority of accesses is the modification of a single node.
If you're searching for peak performance for a specific use case, then in the end, it boils down to implementing both, and running performance comparisons for a typical scenario.
You definitely need at least one semaphore/lock to ensure list integrity.
But, presumably any operation on the list changes at most two nodes: The node being insert/changed/deleted and the adjacent node which points to it. So you could implement locking on a per-node basis, locking at most two nodes for a given operation. This would allow for a degree of concurrency when different threads accesses the list, though you'd need to distinguish between read and write locks to the get full benefit of this approach I think.
If you're new to multi-threading, embrace the notion that premature optimization is a waste of time. Linked lists are a very straight-forward data structure, and you can make it thread-safe by putting a critical section on all reads and writes. These will lock the thread into the CPU for the duration of the execution of the read/insert/delete operation, and ensure thread-safety. They also don't consume the overhead of a mutex lock, or more complicated locking mechanism.
If you want to optimize after the fact, only do so with a valid profiling tool that gives you raw numbers. The linked list operations will never end up being the biggest source of application slowdown, and it will probably never be worth your while to add in the node-level locking being discussed.
Using one lock for the entire list would completely defeat most reasons for multithreading in the first place. By locking the entire list down, you guarantee that only one thread can use the list at a time.
This is certainly safe in the sense that you will have no deadlocks or races, but it is naive and inefficient because you serialize access to the entire list.
A better approach would be to have a lock for each item in the list, and another one for the list itself. The latter would be needed when appending to the list, depending on how the list is implemented (eg, if it maintains a node count seperate from the nodes themselves).
However this might also be less than optimal depending on a number of factors. For instance, on some platforms mutexes might be expensive in terms of resources and time when instantiating the mutex. If space is at a premium, another approach might be to have a fixed-size pool of mutexes from which you draw whenever you need to access an item. These mutexes would have some kind of ownership flag which indicates which node they are allocated to, so that no other mutex would be allocated to that node at the same time.
Another technique is to use reader/write locks, which will allow read access to any thread, but write access to only one, the two being mutually exclusive. However it has been suggested in the literature that in many cases using a reader/write lock is actually less efficient than simply using a plain mutex. This will depend on your actual usage pattern and how the lock is implemented.
You only need to lock when you're writing and you say there's usually only one write, so try read/write locks.
I have a directed graph data structure used for audio signal processing (see http://audulus.com if you're curious).
I would like graph edges to be strong references, so in the absence of cycles, std::shared_ptr would do the trick. Alas, there are potentially cycles in the graph.
So, I had this idea for a simple concurrent mark-sweep collector:
The mutator thread sends events to the collector thread. The collector thread maintains its own representation of the graph and does not traverse the mutator thread's graph. The collector thread just uses mark-sweep at regular intervals.
The events would be the following (in function call form):
AddRoot(Node*)
RemoveRoot(Node*)
AddEdge(Node*, Node*)
RemoveEdge(Node*, Node*)
Is this scheme correct? The collector thread has an older version of what the mutator thread sees. My intuition is that since a node that is unreachable at an earlier time will still be unreachable at a later time, the collector thread may delete an unreachable object as soon as it finds one.
Also, if it's correct for one mutator thread, would it work for multiple mutator threads?
UPDATE
I've released the code here: https://github.com/audulus/collector. The code is actually fairly general purpose. Use RootPtr<T> to automatically keep track of root nodes. Links between nodes are managed using EdgePtr<T>.
The collector seems to work for multiple mutator threads (both in my app and in unit tests), but I feel like a proof of correctness is needed.
PLEASE NOTE (in repsonse to #AaronGolden's comment below, judging from the comments below, people aren't reading this): The mutator thread is responsible for calling the collector functions in the correct order. For example, if the mutator thread calls RemoveEdge(a,b) before assigning b to a RootPtr, the collector may intervene and collect b.
UPDATE 2:
I've updated the code to my latest version and updated the link above. I've now used the code in my app for over a year and haven't attributed any bugs to it.
One argument I think is somewhat persuasive (though I would hesitate to call it proof) that this scheme works is that in the absence of cycles, the scheme is equivalent to reference counting with atomic reference counts.
In the absence of cycles, AddRoot and AddEdge map to incrementing a reference count and RemoveRoot and RemoveEdge map to decrementing. Pushing an event onto the queue (I use boost::lockfree::queue) is an atomic operation just like the updating reference counts.
So then the remaining question is: how do cycles change the picture in terms of correctness? To wave hands a bit, cycles are a property of the connectivity of the graph, but don't have an effect on the atomicity of the operations or the ability of one thread to know something earlier than it would otherwise (causing a potential bad ordering of operations).
This would suggest that if there's a counterexample for the scheme, it will involve playing some game with cycles.
Is this scheme correct?
I'm concerned that you don't have any concept of safe points. In particular, can an update require more than one of your actions to be executed atomically? Perhaps it is ok because you can always add all vertices and edges in a batch before removing.
Also, if it's correct for one mutator thread, would it work for multiple mutator threads?
If one thread drops a root to a subgraph just after another picks up a root to the same subgraph then you must make sure you get the messages in-order which means you cannot use per-mutator queues. And a global queue is likely to kill scalability.
One of my constraints is that the GC has to be lock-free because of my real-time DSP thread
Is the allocator lock-free?
What if the GC cannot keep up with mutator(s)?
Also, I would recommend considering:
Forking and collecting in a child process.
Incremental mark-sweep with Dijkstra's tricolor marking sceme.
Baker's treadmill.
VCGC.
Separate per-thread heaps with deep copying of messages.
Given: a complex structure of various nested collections, with refs scattered in different levels.
Need: A way to take a snapshot of such a structure, while allowing writes to continue to happen in other threads.
So one the "reader" thread needs to read whole complex state in a single long transaction. The "writer" thread meanwhile makes modifications in multiple short transactions. As far as I understand, in such a case STM engine utilizes the refs history.
Here we have some interesting results. E.g., reader reaches some ref in 10 secs after beginning of transaction. Writer modifies this ref each 1 sec. It results in 10 values of ref's history. If it exceeds the ref's :max-history limit, the reader transaction will be run forever. If it exceeds :min-history, transaction may be rerun several times.
But really the reader needs just a single value of ref (the 1st one) and the writer needs just the recent one. All intermediate values in history list are useless. Is there a way to avoid such history overuse?
Thanks.
To me it's a bit of a "design smell" to have a large structure with lots of nested refs. You are effectively emulating a mutable object graph, which is a bad idea if you believe Rich Hickey's take on concurrency.
Some various thoughts to try out:
The idiomatic way to solve this problem in Clojure would be to put the state in a single top-level ref, with everything inside it being immutable. Then the reader can take a snapshot of the entire concurrent state for free (without even needing a transaction). Might be difficult to refactor to this from where you currently are, but I'd say it is best practice.
If you only want the reader to get a snaphot of the top level ref, you can just deref it directly outside of a transaction. Just be aware that the refs inside may continue to get mutated, so whether this is useful or not depends on the consistency requirements you have for the reader.
You can do everything within a (dosync...) transaction as normal for both readers and writer. You may get contention and transaction retries, but it may not be an issue.
You can create a "snapshot" function that quickly traverses the graph and dereferences all the refs within a transaction, returning the result with the refs stripped out (or replaced by new cloned refs). The reader calls snapshot once, then continues to do the rest of it's work after the snapshot is completed.
You could take a snapshot immediately each time after the writer finishes, and store it separately in an atom. Readers can use this directly (i.e. only the writer thread accesses the live data graph directly)
The general answer to your question is that you need two things:
A flag to indicate that the system is in "snapshot write" mode
A queue to hold all transactions that occur while the system is in snapshot mode
As far as what to do if the queue is overflows because the snapshot process isn't fast enough, well, there isn't much you can do about that except either optimize that process, or increase the size of your queue - it's going to be a balance that you'll have to strike depending on the needs of you app. It's a delicate balance, and is going to take some pretty extensive testing, depending on how complex your system is.
But you're on the right track. If you basically put the system in "snapshot write mode", then your reader/writer methods should automatically change where they are reading/writing from, so that the thread that is making changes gets all the "current values" and the thread reading the snapshot state is reading all the "snapshot values". You can split these up into separate methods - the snapshot reader will use the "snapshot value" methods, and all other threads will read the "current value" methods.
When the snapshot reader is done with its work, it needs to clear the snapshot state.
If a thread tries to read the "snapshot values" when no "snapshot state" is currently set, they should simply respond with the "current values" instead. No biggie.
Systems that allow snapshots of file systems to be taken for backup purposes, while not preventing new data from being written, follow a similar scheme.
Finally, unless you need to keep a record of all changes to the system (i.e. for an audit trail), then the queue of transactions actually doesn't need to be a queue of changes to be applied - it just needs to store the latest value of whatever thing you're changing in the system. When the "snapshot state" is cleared, you simply write all those non-committed values to the system, and call it done. The thing you might want to consider is making a log of those changes yet to be made, in case you need to recover from a crash, and have those changes still applied. The log file will give you a record of what happened, and can let you do this recovery. That's an oversimplification of the recovery process, but that's not really what your question is about, so I'll stop there.
What you are after is the state-of-the-art in high-performance concurrency. You should look at the work of Nathan Bronson, and his lab's collaborations with Aleksandar Prokopec, Phil Bagwell and the Scala team.
Binary Tree:
http://ppl.stanford.edu/papers/ppopp207-bronson.pdf
https://github.com/nbronson/snaptree/
Tree-of-arrays -based Hash Map
http://lampwww.epfl.ch/~prokopec/ctries-snapshot.pdf
However, a quick look at the implementations above should convince you this is not "roll-your-own" territory. I'd try to adapt an off-the-shelf concurrent data structure to your needs if possible. Everything I've linked to is freely available on the JVM, but its not native Clojure as such.
First a little explanation of what I am trying to do:
My plan is to write a program with a socket stream implemented using the boost::asio library which feeds data to a parser implemented using boost:spirit::qi. The parser will take the packets and fill a packet object and then append that object to the end of a linked list of packet objects. The packet processor will read the first object in the list and do its processing and then move onto the next item and delete the first.
I decided to use a linked list because a if I used a std::queue I would have to lock the entire container every time the stream added a packet or the processor removed one which would make the two threads run more or less serially, which I would like to avoid. Plus the queue class has a tendency to copy the entire objects whereas the linked list idea has the benefit of creating the object once and then just pointing to it. To avoid serializing this whole business I intend to place boost:mutex mutexes in each node and locking them from there. The idea is to have the socket stream create the list and immediately lock the first node, populate the node from the parser, create the next node and lock it, then unlock the first node and move on to the next node to do work. This way there's never an unlocked node dangling at the end that the packet procesor may jump to and delete under the socket streams nose. The packet processor will check the first node and try to lock it, if it locks it then it will do its processing and then unlock it, get the next node and then delete the first node. This way serialization is limited to those times when the packet processor has caught up to the socket stream class.
So now my question is, before I do the work of actually implementing this, does this sound like a good idea? I've tried it on a trivial test and it seems to work alright an I can't think of any serious issues with this as long as I implement exception handling and take care to free any memory I allocate, but if anyone can think of any problems with this idea that I've overlooked I would appreciate the input. Also I would appreciate any other suggestions anyone might have either as an alternative or that might make this idea work better.
Thanks in advance!
Check this article, it's about multiple consumers, but still brillant:
Measuring Parallel Performance: Optimizing a Concurrent Queue
This implementation is screaming three things at me:
Way too easy to get deadlock because insert and delete will need to lock multiple mutexes simultaneously, and you can't do that simultaneously. Well, you can, but you would need to put mutexes around your mutexes.
Way too easy to get corruption. The problems that lead to deadlock can also lead to corruption.
And slow, slow, slow. Think of your poor list walker. Each step involves an unlock, a lock, another unlock, and another lock. is going to have to be very careful in stepping, and man, will it be expensive. Lock and unlock each item, and do so in the correct order? Ouch.
This looks like a case of premature optimization and premature pessimization operating in parallel. What drove this architecture?
I suggest starting with the simple solution first. Lock the whole thing each time whenever you want to touch any part of it. See if that gets you in trouble. If it doesn't, problem solved. If it does, a next step up is to switch from mutexes to read-write locks. Just one, not a gazillion. Now you have to think a bit about whether you want a shared or exclusive lock. If you have been diligent about const-correctness, try using a shared lock (read lock) for your const methods, an exclusive lock (write lock) for your non-const methods.
I don't think what you suggest will work. Remember that when you delete a node from a linked list, you need to update the other nodes that point to the deleted node. Similarly, when you add a node you also need to update other nodes to point at the new node. So just locking the node being deleted or added isn't enough.
There are lock-free queues, but they are fiendishly difficult to get correct. For instance, look at the initial comments to the article here describing the extra work required to get a published algorithm to work.
Even though you are calling this a linked list, this is, in effect, a queue.
It is possible to implement Single Producer Single Consumer lock-free queues, if you are willing to use a fixed-size buffer. This let you control the memory usage at the cost of making the Producer wait if the Consumer is not fast enough.
Apart from this slight point, your design looks fine; it will probably be easier to get it right than the lock-free alternative.
Do remember to have a termination condition (a null pointer in the next field for example) so that the Producer may signal to the Consumer that there is no more things to process.
Hmm.. why such a complex solution to a common problem? There are plenty of producer-consumer queue classes out there - pick a reference/pointer/int sized one that works, (ie. no data copying).
Get bytes from your stream until you have assembled a 'packet object' according to your protocol. Shove the packet object reference onto the queue and immediately new() another one for the next packet. The packet processor consumes packet objects from the queue.
The queue is only locked for the time taken to push/pop an object reference to/from the queue. The packet object being assembled by the socket/stream callback and the packet object being processed by the packet processor are always different, so no locking required.
Trying to operate on objects that are already in the queue, (or queue-like linked-list), sounds like a nightmare to me, (and the other posters seem to agree). Is there some other reason why you need to do this?
Rgds,
Martin