STM and alter in clojure - clojure

I am working through the Programming Clojure book. While explaining alter and the STM, they say that if, during an alter, Clojure detects a change to the ref from outside the transaction, it will re-run the transaction with the new value. If that is the case, I would imagine the update function you pass in needs to be pure, but that isn't indicated in the docs (and it is in other similar situations).
So is my assumption correct? If not, how does the STM re-apply the function? If it is correct, is it the case that you can't rely on the docs to tell you when you can have side effects, and when you can't?

It doesn't strictly have to be pure, it just has to be idempotent. In practice this is basically the same thing.
Further, it only has to be idempotent when seen outside of the STM: if the only side effect you produce is writing to some other ref or (I think) sending to an agent, that operation will be held until your transaction has succeeded.
It's also not really the case that it has to be any of these things: just that, if your update function isn't pure, the results may not be what you expect.
Edit: dosync's docs tell you that any expressions in the body may be executed more than once. You can't run an alter without running a dosync, so it looks like all the docs you need are there. What would you like changed?

Just as a side note:
If you need to perform side-effects like logging in your STM transation you can send messages to agents to do the non-idempotent parts. Messages sent to agents are dispatched only when the transaction finishes and are guaranteed to only be sent once.

The point in Clojure is that there is no side effect when you deal with Transactions, because they acre consistent, and the function will re-run (I prefer retry) when it finds a conflict during the update of the Shared Value, otherwise it will commit succesfuly the change.
If it has to retry, it will read the updated value, so there is no side effect, the problem you could find is a Livelock, but it is kind of controlled by a limit number in retries from Clojure.

Related

Event Sourcing/CQRS doubts about aggregates, atomicity, concurrency and eventual consistency

I'm studying event sourcing and command/query segregation and I have a few doubts that I hope someone with more experience will easily answer:
A) should a command handler work with more than one aggregate? (a.k.a. should they coordinate things between several aggregates?)
B) If my command handler generates more than one event to store, how do you guys push all those events atomically to the event store? (how can I garantee no other command handler will "interleave" events in between?)
C) In many articles I read people suggest using optimistic locking to write the new events generated, but in my use case I will have around 100 requests / second. This makes me think that a lot of requests will just fail at huge rates (a lot of ConcurrencyExceptions), how you guys deal with this?
D) How to deal with the fact that the command handler can crash after storing the events in the event store but before publishing them to the event bus? (how to eventually push those "confirmed" events back to the event bus?)
E) How you guys deal with the eventual consistency in the projections? you just live with it? or in some cases people lock things there too? (waiting for an update for example)
I made a sequence diagram to better ilustrate all those questions
(and sorry for the bad english)
If my command handler generates more than one event to store, how do you guys push all those events atomically to the event store?
Most reasonable event store implementations will allow you to batch multiple events into the same transaction.
In many articles I read people suggest using optimistic locking to write the new events generated, but in my use case I will have around 100 requests / second.
If you have lots of parallel threads trying to maintain a complex invariant, something has gone badly wrong.
For "events" that aren't expected to establish or maintain any invariant, then you are just writing things to the end of a stream. In other words, you are probably not trying to write an event into a specific position in the stream. So you can probably use batching to reduce the number of conflicting writes, and a simple retry mechanism. In effect, you are using the same sort of "fan-in" patterns that appear when you have concurrent writers inserting into a queue.
For the cases where you are establishing/maintaining an invariant, you don't normally have many concurrent writers. Instead, specific writers have authority to write events (think "sharding"); the concurrency controls there are primarily to avoid making a mess in abnormal conditions.
How to deal with the fact that the command handler can crash after storing the events in the event store but before publishing them to the event bus?
Use pull, rather than push, as the primary subscription mechanism. Make sure that subscribers can handle duplicate messages safely (aka "idempotent"). Don't use a message subscription that can re-order events when you need events strictly ordered.
How you guys deal with the eventual consistency in the projections? you just live with it?
Pretty much. Views and reports have metadata information in them to let you know at what fixed point in "time" the report was accurate.
Unless you lock out all writers while a report is being consumed, there's a potential for any data being out of date, regardless of whether you are using events vs some other data model, regardless of whether you are using a single data model or several.
It's all part of the tradeoff; we accept that there will be a larger window between report time and current time in exchange for lower response latency, an "immutable" event history, etc.
should a command handler work with more than one aggregate?
Probably not - which isn't the same thing as always never.
Usual framing goes something like this: aggregate isn't a domain modeling pattern, like entity. It's a lifecycle pattern, used to make sure that all of the changes we make at one time are consistent.
In the case where you find that you want a command handler to modify multiple domain entities at the same time, and those entities belong to different aggregates, then have you really chosen the correct aggregate boundaries?
What you can do sometimes is have a single command handler that manages multiple transactions, updating a different aggregate in each. But it might be easier, in the long run, to have two different command handlers that each receive a copy of the command and decide what to do, independently.

How to get long running transactions to fail fast in clojure

Assuming that the ref in the following code is modified in other transactions as well as the one below,
my concern is that this transaction will run until it's time to commit, fail on commit, then re-run the transaction.
(defn modify-ref [my-ref]
(dosync (if (some-prop-of-ref-true #my-ref)
(alter my-ref long-running-calculation))))
Here's my fear in full:
modify-ref is called, a transaction is started (call it A), and long-running-calculation starts
another transaction (call it B) starts, modifies my-ref, and returns (commits successfully)
long-running-calculation continues until it is finished
transaction A tries to commit but fails because my-ref has been modified
the transaction is restarted (call it A') with the new value of my-ref and exits because some-prop is not true
Here's what I would like to happen, and perhaps this is what happens (I just don't know, so I'm asking the question :-)
When the transaction B commits my-ref, I'd like transaction A to immediately stop (because the value of my-ref has changed) and restart with the new value. Is that what happens?
The reason I want this behavior is so that long-running-calculation doesn't waste all that CPU time on a calculation that is now obsolete.
I thought about using ensure, but I'm not sure how to use it in this context or if it is necessary.
It works as you fear.
Stopping a thread in the JVM doing whatever it is doing requires a collaborative effort so there is no generic way for Clojure (or any other JVM language) to stop a running computation. The computation must periodically check a signal to see if it should stop itself. See How do you kill a thread in Java?.
About how to implement it, I would say that is just too hard, so I would measure first if it is really really an issue. If it is, I would see if a traditional pessimistic lock is a better solution. If pessimistic locks is still not the solution, I would try to build something that runs the computation outside the transactions, use watchers on the refs and sets the refs conditionally after the computation if they have still the same value. Of course this runs outside the transactions boundaries and probably it is a lot more tricky that it sounds.
About ensure, only refs that are being modified participate in the transaction, so you can suffer for write skew. See Clojure STM ambiguity factor for a longer explanation.
This doesn't happen, because...well, how could it? Your function long-running-calculation doesn't have any code in it to handle stopping prematurely, and that's the code that's being run at the time you want to cancel the transaction. So the only way to stop it would be to preemptively stop the thread from executing and forcibly restart it at some other location. This is terribly dangerous, as java.lang.Tread/stop discovered back in Java 1.1; the side effects could be a lot worse than some wasted CPU cycles.
refs do attempt to solve this problem, sorta: if there's one long-running transaction that has to restart itself many times because shorter transactions keep sneaking in, it will take a stronger lock and run to completion. But this is a pretty rare occurrence (heck, even needing to use refs is rare, and this is a rare way for refs to behave).

taking a snapshot of complex mutable structure in concurrent environment

Given: a complex structure of various nested collections, with refs scattered in different levels.
Need: A way to take a snapshot of such a structure, while allowing writes to continue to happen in other threads.
So one the "reader" thread needs to read whole complex state in a single long transaction. The "writer" thread meanwhile makes modifications in multiple short transactions. As far as I understand, in such a case STM engine utilizes the refs history.
Here we have some interesting results. E.g., reader reaches some ref in 10 secs after beginning of transaction. Writer modifies this ref each 1 sec. It results in 10 values of ref's history. If it exceeds the ref's :max-history limit, the reader transaction will be run forever. If it exceeds :min-history, transaction may be rerun several times.
But really the reader needs just a single value of ref (the 1st one) and the writer needs just the recent one. All intermediate values in history list are useless. Is there a way to avoid such history overuse?
Thanks.
To me it's a bit of a "design smell" to have a large structure with lots of nested refs. You are effectively emulating a mutable object graph, which is a bad idea if you believe Rich Hickey's take on concurrency.
Some various thoughts to try out:
The idiomatic way to solve this problem in Clojure would be to put the state in a single top-level ref, with everything inside it being immutable. Then the reader can take a snapshot of the entire concurrent state for free (without even needing a transaction). Might be difficult to refactor to this from where you currently are, but I'd say it is best practice.
If you only want the reader to get a snaphot of the top level ref, you can just deref it directly outside of a transaction. Just be aware that the refs inside may continue to get mutated, so whether this is useful or not depends on the consistency requirements you have for the reader.
You can do everything within a (dosync...) transaction as normal for both readers and writer. You may get contention and transaction retries, but it may not be an issue.
You can create a "snapshot" function that quickly traverses the graph and dereferences all the refs within a transaction, returning the result with the refs stripped out (or replaced by new cloned refs). The reader calls snapshot once, then continues to do the rest of it's work after the snapshot is completed.
You could take a snapshot immediately each time after the writer finishes, and store it separately in an atom. Readers can use this directly (i.e. only the writer thread accesses the live data graph directly)
The general answer to your question is that you need two things:
A flag to indicate that the system is in "snapshot write" mode
A queue to hold all transactions that occur while the system is in snapshot mode
As far as what to do if the queue is overflows because the snapshot process isn't fast enough, well, there isn't much you can do about that except either optimize that process, or increase the size of your queue - it's going to be a balance that you'll have to strike depending on the needs of you app. It's a delicate balance, and is going to take some pretty extensive testing, depending on how complex your system is.
But you're on the right track. If you basically put the system in "snapshot write mode", then your reader/writer methods should automatically change where they are reading/writing from, so that the thread that is making changes gets all the "current values" and the thread reading the snapshot state is reading all the "snapshot values". You can split these up into separate methods - the snapshot reader will use the "snapshot value" methods, and all other threads will read the "current value" methods.
When the snapshot reader is done with its work, it needs to clear the snapshot state.
If a thread tries to read the "snapshot values" when no "snapshot state" is currently set, they should simply respond with the "current values" instead. No biggie.
Systems that allow snapshots of file systems to be taken for backup purposes, while not preventing new data from being written, follow a similar scheme.
Finally, unless you need to keep a record of all changes to the system (i.e. for an audit trail), then the queue of transactions actually doesn't need to be a queue of changes to be applied - it just needs to store the latest value of whatever thing you're changing in the system. When the "snapshot state" is cleared, you simply write all those non-committed values to the system, and call it done. The thing you might want to consider is making a log of those changes yet to be made, in case you need to recover from a crash, and have those changes still applied. The log file will give you a record of what happened, and can let you do this recovery. That's an oversimplification of the recovery process, but that's not really what your question is about, so I'll stop there.
What you are after is the state-of-the-art in high-performance concurrency. You should look at the work of Nathan Bronson, and his lab's collaborations with Aleksandar Prokopec, Phil Bagwell and the Scala team.
Binary Tree:
http://ppl.stanford.edu/papers/ppopp207-bronson.pdf
https://github.com/nbronson/snaptree/
Tree-of-arrays -based Hash Map
http://lampwww.epfl.ch/~prokopec/ctries-snapshot.pdf
However, a quick look at the implementations above should convince you this is not "roll-your-own" territory. I'd try to adapt an off-the-shelf concurrent data structure to your needs if possible. Everything I've linked to is freely available on the JVM, but its not native Clojure as such.

Testing concurrent data structure

What are some methods for testing concurrent data structures to make sure the data structs behave correctly when accessed from multiple threads ?
All of the other answers have focused on actually testing the code by putting it through its paces and actually running it in one form or another or politely saying "don't do it yourself, use an existing library".
This is great and all, but IMO, the most important (practical tests are important too) test is to look at the code line by line and for every line of code ask "what happens if I get interrupted by another thread here?" Imagine another thread, running just about any of the other lines/functions during this interruption. Do things still stay consistent? When competing for resources, does the other thread[s] block or spin?
This is what we did in school when learning about concurrency and it is a surprisingly effective approach. Bottom line, I feel that taking the time to prove to yourself that things are consistent and work as expected in all states is the first technique you should use when dealing with this stuff.
Concurrent systems are probabilistic and errors are often difficult to replicate. Therefore you need to run various input/output cases, each tested over time (hours, days, etc) in order to detect possible errors.
Tests for concurrent data structure involves examining the container's state before and after expected events such as insert and delete.
Use a pre-existing, pre-tested library that meets your needs if possible.
Make sure that the code has appropriate self-consistency checks (preferably fast sanity checks), and run your code on as many different types of hardware as possible to help narrow down interesting timing problems.
Have multiple people peer review the code, preferably without a pre-explanation of how it's supposed to work. That way they have to grok the code which should help catch more bugs.
Set up a bunch of threads that do nothing but random operations on the data structures and check for consistency at some rate.
Start with the assumption that your calls to access/modify data are not thread safe and use locks to ensure only a single thread can access/modify any part of the data at a time. Only after you can prove to yourself that a specific type of access is safe outside of the lock by multiple threads at once should you move that code outside of the lock.
Assume worst case scenarios, e.g. that your code will stop right in the middle of some pointer manipulation or another critical point, and that another thread will encounter that data in mid-transition. If that would have a bad result, leave it within the lock.
I normally test these kinds of things by interjecting sleep() calls at appropriate places in the distributed threads/processes.
For instance, to test a lock, put sleep(2) in all your threads at the point of contention, and spawn two threads roughly 1 second apart. The first one should obtain the lock, and the second should have to wait for it.
Most race conditions can be tested by extending this method, but if your system has too many components it may be difficult or impossible to know every possible condition that needs to be tested.
Run your concurrent threads for one or a few days and look what happens. (Sounds strange, but finding out race conditions is such a complex topic that simply trying it is the best approach).

what are the benefits of clojure promises over using add-watch?

I am looking at different ways of implementing concurrency in clojure and these seem to be two competing ways of doing the same thing, so I was wondering where I should use each technique.
Watches are about one entity in a concurrent system and promises are about two entities.
promises are more of a way to communicate between events on different timelines. They provide a way for a piece of code to receive a response with out having to worry about what mechanism will be providing the answer. the original code path can create a promise and pass it to two different code paths in a single thread, or threads, or agents, or nodes in a distributed system. then when one of the threads/agents/refs needs an answer it can block on the promise with out having to know anything about the entity that will be fulfilling the promise. And when the other thread/agent/ref/other figures out the answer it can fulfill the promise with out having to know anything about the entity that is waiting on the promise (or not yet waiting).
promises are a communication mechanism across timelines that are independent of the concurrently mechanism used.
Watches are a way of specifying a function to call when an atom or ref changes. this is a way of communicating intent to all the future states of a single agent/ref, by saying "Hey, make sure this condition is always true", or "log the change here".
Watches and promises are both very useful for concurrency, but are suited to slightly different uses. You may well find that you want to use both in different places in the same application.
Use a watch if you want notification of change in a reference. For example, if one thread is handling events and updates a ref in response to some of these events, you could use add-watch to enable other parts of your system to receive notification of the update. A single watch can handle many updates over time.
Use a promise if you want to pass another thread a handle to access a value that is not yet computed. If the other thread tries to dereference the promise, they will block until the computation of the promise has finished (i.e. the original thread places a value in the promise via "deliver"). A single promise is only intended to be used once - after that it is just a fixed value.