Should agents only hold immutable values - concurrency

In Clojure Programming (OReilly) there is an example where both a java.io.BufferedWriter, and a java.io.Printwriter is put inside an agent (one agent for each). These are then written to inside the agent action. The book says that it is safe to perform io in an agent action. As I understand it all side effecting operations are ok inside an agent action. This is because agent actions inside commits are only run if the commit is succesful. And agent actions inside other agent actions are only run after the outer agent action completes successfully. Agent actions in general are guaranteed to be applied serially.
The Clojure documentation says this: "The state of an Agent should be itself immutable...".
As I understand it, the reason that atoms and refs must hold immutable values is so that clojure can roll back and retry commits several times.
What I don't understand is:
1: If Clojure makes sure that agent actions are only run once, why must agent values be immutable.
(for example if I hold a java array in an agent, and add to it in an agent action, this should be fine because the action will only run once. This is very similar to adding lines to a BufferedWriter)
2: Is java.io.BufferedWriter considered immutable? I understand that you could have a stable reference to one, but if the agent action is performing io on it, should it still be considered immutable?
3: If BufferedWriter is considered immutable, how do I decide if other similar java classes are immutable?

As I see it:
Values held by agents should be 'effectively immutable' (term borrowed from JCIP), in that they should always be conceptually equal to themselves.
This means, if I .clone() an object and compare both copies, original.equals(copy) should be true, no matter of what I do (and when).
In this sense, an instance of the typical Employee class full of getters/setters can not be guaranteed to equal to itself, in face of mutability: equals() will be defined as a field-by-field comparison, so the test can fail.
A BufferedWriter though, does not represent a value - its equality is defined in terms of being exactly the same object in memory. So it has a 'sound' mutability -unlike Employee's- which makes it apt for wrapping it in an agent.
I believe that you are right in that from the STM point of view, agent-value mutability wouldn't hurt a lot. But it would in that it'd break Clojure's time model, in which you 'cannot change the past', etc.
On deciding whether a Java class is immutable: impossible without diving into the implementation. You don't have to care about this too much though.
I'd make the following taxonomy of types in Java-land:
Mutable objects which (badly) represent values - Employee, etc. Never wrap them in a Clojure reference type.
Immutable objects which represent values - their immutability is reflected in the doc, or in naming conventions ("EmployeeBuilder"). Safe to wrap in any Clojure reference.
Unmanaged collection types - ArrayList, etc. Avoid except for interop purposes.
Managed reference/collection types - AtomicReference, blocking queues... They play fine with Clojure, dubious to wrap them in Clojure references though.
'IO' types - BufferedWriter, Swing stuff... you don't care about their mutability because they don't represent values at all - you just want them for their side effects. It might make sense to guard them in agents to guarantee access serialization.

The agent value should be immutable because someone can do this:
(def my-agent (agent (BufferedWriter.)))
(.write #my-agent "Hello world")
Which is basically modifying the agent value (in this case the writer) without going through agent control mechanism.
Yes, BufferedWriter is mutable because by writing to it you can change its internal state. It is like a pointer or reference and not a value.

Related

What is the real benefit of Immutable Objects

I always hear people saying that it's easier to manage immutable objects when working with multiple threads since when one thread accessed an immutable object, it doesn't have to worry that another thread is changing it.
Well, what happens if I have an immutable list of all employees in a company and a new employee is hired? In this case the immutable list has to be duplicated and the new copy of it has to include another employee object. Then the reference to list of employees should be directed to the new list.
When this scenario happens, the list itself doesn't change, but the reference to this list changes, and therefore the code "sees" different data.
If so, I don't understand why immutable objects makes our lives easier when working with multi-threads. What am I missing?
The main problem concurrent updates of mutable data, is that threads may perceive variable values stemming from different versions, i.e. a mixture of old and new values when speaking of a single update, forming an inconsistent state, violating the invariants of these variables.
See, for example, Java’s ArrayList. It has an int field holding the current size and a reference to an array whose elements are references to the contained objects. The values of these variables have to fulfill certain invariants, e.g. if the size is non-zero, the array reference is never null and the array length is always greater or equal to the size. When seeing values of different updates for these variables, these invariants do not hold anymore, so threads may see a list contents which never existed in this form or fail with spurious exceptions, reporting an illegal state that should be impossible (like NullPointerException or ArrayIndexOutOfBoundeException).
Note that thread safe or concurrent data structures only solve the problem regarding the internals of the data structure, so operations do not fail with spurious exceptions anymore (regarding the collection’s state, we’ve not talked about the contained element’s state yet), but operations iterating over these collections or looking at more than one contained element in any form, are still subject to possibly observing an inconsistent state regarding the contained elements. This also applies to the check-then-act anti-pattern, where an application first checks for a condition (e.g. using contains), before acting upon it (like fetching, adding or removing an element), whereas the condition might change in-between.
In contrast, a thread working on an immutable data structure may work on an outdated version of it, but all variables belonging to that structure are consistent to each other, reflecting the same version. When performing an update, you don’t need to think about exclusion of other threads, it’s simply not necessary, as the new data structures are not seen by other threads. The entire task of publishing a new version reduces to the task of publishing the root reference to the new version of your data structure. If you can’t stop the other threads processing the old version, the worst thing that can happen, is that you may have to repeat the operation using the new data afterwards, in other words, just a performance issue, in the worst case.
This works smooth with programming languages with garbage collection, as these allow it to let the new data structure refer to old objects, just replacing the changed objects (and their parents), without needing to worry about which objects are still in use and which not.
Here is an example: (a) we have a immutable list, (b) we have writer thread that adds elements to the list and (c) 1000 read threads that read the list without changing it.
It will work without locks.
If we have more than one writer thread we will still need a write lock. If we have to remove entries from the list we will need read-write lock.
Is it valuable? Do not know.

Is Immutability more useful for concurrency?

It's usually said that immutable data structures are more "friendly" for concurrency programming. The explanation is that if a data structure is mutable and one thread modifies it, then another thread will "see" the previous mode of the data structure.
Although it's impossible to modify immutable data structure, if one needs to change it, it's possible to create a new data structure and give it the reference of the "old" data structure.
In my opinion, this situation is also not thread safe because one thread can access the old data structure and the second one can access the new data structure. If so, why is immutable data structure considered more thread-safe?
The idea here is that you can´t change an object once it has been created. Every object that is part of some structure is itself immutable. When you create new structure, and reuse some components of old structure, you still can´t change any internal value of any component that makes up this new structure. You can easily identify each structure by its root component reference.
Of course, you still need to make sure you swap them in thread safe fashion, this is usually done using variants of CAS (compare and swap) instructions. Or you can use functional programing, which idiom of functions without side-effects (take immutable input and produce new result) is ideal for thread safe, multi threaded programming.
There are many benefits to immutability over mutability, but that does not mean immutable is always better. Every approach has its benefits and applications. Take a look at this answer for more details on immutability uses. Also check this nicely written answer about mutability benefits in some circumstances.

C++ pattern for multiple "worker classes" that manipulate a large data class

I have a single very large data class (really a struct, honestly), which needs to be manipulated in enough different ways that I don't just want to implement all of the manipulators as member methods of the data class.
Right now, I have the manipulators set up as singletons, or small instantiated classes held by some manager object, and I pass every manipulator a pointer to the data class during initialization. This works, but feels a little sloppy to me.
One complicating issue is that the manipulators have state. One example of manipulator state that could be factored out of the manipulators themselves is thread-safety helpers (mutexes/semaphores), but there are other data members that logically belong to the manipulators so I don't think this problem is going away.
So I'm wondering, is there some design pattern that can provide a cleaner solution for this situation?
A factory pattern could be used with the factory offering a method taking a pointer or reference to the data, and a value (possibly enumeration) indicating the operation to be performed, it then selects the agent that can perfrom that operation and asks it to do so.
As for the states, if the agents' states are synchronised then a single state in the factory would be fine - if they are not then the factory could simply provide a method to be called in the event that anything happens that could change any agent's state and inform all the agents. Or, the agents could themselves be observers of whatever it is that causes the state changes.
As for implementing a state machine - that's also often done using a factory pattern! So you could have a factory of factories where each sub factory is also an observer. This would be almost too awesome for words.

how do atoms differ from refs?

How do atoms and refs actual differ?
I understand that atoms are declared differently and are updated via the swap! function, whereas refs use alter inside a dosync. The internal implementation, however, seems quite similar, which makes me wonder why I would use one and not the other.
For example, the doc page for atoms (http://clojure.org/atoms) states:
Internally, swap! reads the current value, applies the function to it, and attempts to compare-and-set it in. Since another thread may have changed the value in the intervening time, it may have to retry, and does so in a spin loop. The net effect is that the value will always be the result of the application of the supplied function to a current value, atomically. However, because the function might be called multiple times, it must be free of side effects.
The method described sounds quite similar to me to the STM used for refs.
The difference is that you can't coordinate changes between multiple atoms but you can coordinate changes between multiple refs.
Ref changes have to take place inside of a dosync block. All of the changes in the dosync take place or none of them do (atomic) but that extends to all changes to the refs within that dosync. This behaves a lot like a database transaction.
Let's say for example that you wanted to remove an item from one collection and add it to another but without anyone seeing a case where neither collection has the item. That's impossible to guarantee with atoms but you can guarantee it with refs.
Keep in mind that:
Use Refs for synchronous, coordinated and shared changes.
Use Atoms for synchronous, independent and shared changes.
To me, I don't care about the implementation differences between the atoms and refs. What I care about is the "Use Cases" of each on of them.
I use refs with I need to change the state of more than one reference type and I need the ATM semantics. I use atoms when I change the state of one reference type (object, depends on how you see it).
For example, if I need to increase the number of page hits in a web analytics system; I use atoms. If I need to transfer money between two accounts, I use refs.

Notifying a class of another class' changes

Having the classes Container, Item and Property, the container shall be notified whenever a property in an item changes.
The container is the owner of the items and needs the information to properly manage them according to their properties.
I've thought of 2 options yet:
Observer pattern.
Proxy object.
The observer pattern seems to be too heavy for that task in my opinion. A proxy object could work out well, however in that case I'd violate the DRY principle, because I have to forward calls from the proxy to the actual object.
A requirement is that the details are hidden from the user. It's required that it's not needed to call some update_item() function or similar, i.e. giving the responsibility of informing the container to the user, which might lead to usage problems.
In such simple case I don't see a reason of using Observer. Since an Item can be only in one container at once I would just go with giving the Item a reference or pointer to the container it is placed in.
When some Property of the Item changes it as able to notify it's Container via that pointer
Observer pattern is useful in case you need to notify many objects.
EDIT
Making every simple thing using Interfaces and extremely clean design may also harm you. I think the quote from zen of Python explains good what i mean:
Special cases aren't special enough to break the rules. //make Interfaces
Although practicality beats purity. //but not everywhere
So you should a have balance between purity and practicality
You should use the pattern that is easiest to understand and maintain, and requires the least invention of specialized components. In the environment I work in (objective-C), the observer pattern is about as uncomplicated as it gets. It also offers flexibility when your notification requiements change.
Andrew's answer is event simpler - direct communication between objects does not requie the invention of a proxy or the overhead of notification handling. But it has less flexibility, should you need it in the future.
I'm not sure what "too heavy" means. Perhaps you can explain that.
As has been mentioned before, an Observer is pretty much overkill here, but the solution is pretty simple. You just need to "bubble up" the event.
When a property is changed, it notifies it's parent item.
When an item is changed (a side-affect from either a property changing, or something more integral to the item), it notifies it's container/parent).
When a container is notified, well, you're done. If containers can be nested then I guess it can raise an event to it's parent container if necessary.